NLP with Tensorflow — Tokenizing and Sequencing the sentences

Krishna Kankipati
May 30, 2021
2 min read

Updated: May 31, 2021

When we are dealing with images, it is easy for us to feed them into a neural network, as the pixel values were already numbers. But what happens with text? How can we do that with sentences and words?

In this part of the NLP series, we will learn how to build models that understand text that are trained on labeled text and then classify the new text based on what they have seen.

Well, I think you got an idea, why can’t we encode each character(character encodings). Well let’s take a look at an example…

R E A D — 82 69 65 68

D E A R — 68 69 65 82

Here the characters are encoded into their ASCII values. Does this provide the semantics of the word? READ and DEAR are two different words with the same characters!

Okay, what if we encoded each word and used those values to feed into a neural network?

I love to eat bananas — 1 2 3 4 5, each word is encoded into a value

I love to eat grapes — 1 2 3 4 6, since the first three words are encoded earlier we are encoding the last word with another value. That is we have created a new token for ‘grapes’!

This could help us build a neural network model based on words.

How can we start training a neural network based on words? Simple, by using Tensorflow and Keras APIs. Look at the code below to tokenize sentences…

Now let’s add another sentence and see what the Tokenizer does.

Cool! We built a dictionary of all words to make a corpus. Now we need to turn your sentences into lists of values based on these tokens.

The ‘text_to_sequences’ call can take any set of sentences, so it can encode them based on the word set that it learned from the one that was passed into ‘fit_on_texts’.

If you train a neural network on a corpus of texts, and the text has a word index generated from it, then when you want to do inference with the train model, you’ll have to encode the text that you want to infer on with the same word index, otherwise it would be meaningless.

We can avoid this by a large set of training data to get a broad vocabulary so that we can’t miss the words. We can make it simple by using a value instead of ignoring an unseen word!

Have you observed something? The neural network takes an input array of the same length. The sentences in the dataset maynot be of the same number of words. The arrays in a sequence have different lengths.

Is there a way to handle this? Yes, it’s Padding!

In the upcoming articles we will discuss Padding and many more…

Data OilSt.

NLP with Tensorflow — Tokenizing and Sequencing the sentences

Related Posts

Comments

Any Querries?

DataOil St.