AI on text — Natural Language Processing Basics part 2 — Bag of Words and TFIDF
In the previous blog we have seen how to clean text and now we will jump to feature extraction i.e., converting the text into vectors of numbers which is readable by machine. Let us focus on few important feature extraction techniques.
If we would want to predict the price of houses, we will need certain features related to that house like locality, area of house, distance from schools, hospitals, material used in building etc., They can have values in numbers or categories(which later will be converted to numbers using label or one hot encoding). After analyzing these features, we apply ML/DL algorithm which tires to reduce the error that was obtained from its prediction and the actual value. In this process, we only observe the algos play with numbers in all calculations.
Consider a famous problem statement of spam detection. Here we only have text from emails — what features can you think for this data?? We have to convert text to features which are vectors of numbers and are easily processable by any algorithm. This process of transforming text to features is called feature extraction.
There are certain concepts on how we can do feature extraction. They are —
- Bag of words
- TF-IDF
1. Bag of Words:
In bag of words approach, we discard the order in which the words are in the input text and we consider a whole set of words that are in the text. This model converts the text into set of words and counts the frequency of occurrence of each word.
Raw Text: The input corpus for performing analysis, Cleaning and Tokenizing is converting them to separate words. Way to clean and tokenize are discussed in this blog.
Vocab building is nothing but creating a set of all the unique words that are obtained from tokenization. Then we create vectors using the frequency of occurrence.
For Example:
Consider 3 sentences:
- John drove his car to Paris.
- The flight was late.
- The flight was full. Traveling by flight is expensive.
Lets check what are unique words in the 3 sentences which is our Vocab.
Combining all the unique words from sentences will give us the vocab or Features of the text corpus.
Python Implementation:
This results are:
BoW approach is used mainly in NLP for document classification and information retrieval tasks.
There are few limitations using this approach:
- As order of the words is not considered, context of the sentence or the semantic meaning is lost
- This approach will lead to higher computational time with large vector sizes if there are long sentences and also resulting in very sparse(so many zeroes) vectors.
Lets look at another approach called Tf-Idf Vectorization:
Term Frequency — Inverse Document Frequency:
Tf-Idf method is a similar approach to bag of words but it penalizes non-important words in a document/sentences/corpus and gives more weight to valuable words. Lets look at how this model works.
The basic concept is that in some documents, there are words which appear many times in a document which might be of higher importance. Also at the same time, this word might be present in few other documents in the corpus reducing its importance.
For example:
There are 3 sentences as we have seen in bag of words and lets understand how the scores are calculated.
Term Frequency (TF): Enter the frequency of occurrence of that word in that sentence
Inverse Document Frequency Scores:
Final TF-IDF Scoring:
From the scores above we can clearly see that the words ‘The’, ‘was’ are penalized as they are repeated in majority of the documents and the word ‘flight’ in the 3rd sentence is valuable as its term frequency is high. Though the primary disadvantage of context preservation is being missed but information retrieval is more meaningful using TF-IDF rather than BoW.
We can implement this whole calculations in sklearn. Final Code:
The outputs are
We convert text to vectors with features which will now lead us to process them for our designed tasks. So the thumb rule is the higher the TF*IDF score, the rarer or unique or valuable the term and vice versa. Hope you understood the basic difference of BoW, TF-IDF and understood how to create numerical vectors with text. In next blogs we will play around with few intriguing ideas that has revolutionized NLP.
Check my previous blog on text processing and next blog on Word2Vec/Glove
Thank You! :)