Sarcasm dataset - Tokenizing, Sequencing and Padding
In the previous articles we have discussed Tokenizing, Sequencing and Padding the sentences…now we will apply those methods on a real dataset.
News Headlines Dataset For Sarcasm Detection dataset —
Each record consists of three attributes:
1. is_sarcastic: 1 if the record is sarcastic otherwise 0
2. headline: the headline of the news article
3. article_link: link to the original news article. Useful in collecting supplementary data
Follow this link to know more about the dataset…Kaggle
Now we shall see how to apply the methods we have learned
1. Loading the dataset and creating 3 lists to store ‘article_link’, ‘headline’ and ‘url’ info from each data point.
2. Tokenizing, Sequencing and Padding the sentences list.
The length of word_index is 29657. We can see a padded sentence of size 40 i.e the largest sequence is of length 40.