AI on Text — Natural Language Processing Part 1 — Preprocessing Text

8 min readAug 18, 2021

Introduction

Most of us at some point of time have been amazed by the applications like Siri, Alexa, Cortana, interactive chat bots. These can perform tasks like answering to our questions, starting call, setting up remainders, help in navigation etc., just by recognizing our voice. The competence of them all is the ability to process speech/text and produce desired results. NLP is the study of how computer manipulates and understands the normal language (English, Spanish, French etc.,) which do not have explicit rules like in programmed languages and may have dialects.

Few of the ingenious applications of NLP include — Text summarization, language translation, speech recognition, entity recognition, question answering, auto complete etc., These applications are then integrating with smart devices, IoT etc., Due to the high demand of such applications and breakthrough innovations, NLP is gaining widespread recognition and interest in AI.

The key concept in NLP is first converting all the text obtained to numeric format which can be easily processed further in to machine for complex tasks. To perform this, we need to clean the text a bit, as not all the words we speak or write are important. For example, In the sentence — “John works in a leading IT firm as a Project Lead”, important words are “John”, “works”, “IT firm”, “Project lead”. Pronouns, prepositions etc., can be excluded. There are few more important transformations we need to perform before converting it to vectors of numbers, we will look into in further sections.

NLTK Introduction:

NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with natural language text. It provides easy-to-use interfaces to many corpora and lexical resources. Also, it contains a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. More than anything, NLTK is a free, open source, community-driven project.

We’ll use this toolkit to show some basics of the natural language processing field. For this we will have to import nltk library

Basic transformations on NLP text:

In this section we will understand the following topics:

Tokenization
Stop words
Regex
Stemming and Lemmatization

1. Tokenization

Tokenization is breaking down sentences/paragraphs to sequence of words/sentences. There are two tokenization techniques provided by NLTK. One is sentence tokenizer, other is word tokenizer. Lets consider a paragraph on ice-cream in wikipedia page. Lets observe how tokenization works on this text.

“Ice cream (derived from earlier iced cream or cream ice)[1] is a sweetened frozen food typically eaten as a snack or dessert. It may be made from dairy milk or cream and is flavoured with a sweetener, either sugar or an alternative, and any spice, such as cocoa or vanilla. It can also be made by whisking a flavored cream base and liquid Nitrogen together”

Sentence tokenizer on this text will produce the following output by splitting the paragraph into 3 sentences.

Ice cream (derived from earlier iced cream or cream ice)[1] is a sweetened frozen food typically eaten as a snack or dessert.It may be made from dairy milk or cream and is flavoured with a sweetener, either sugar or an alternative, and any spice, such as cocoa or vanilla.It can also be made by whisking a flavored cream base and liquid Nitrogen together.

Word-Tokenizer will divide each sentence into words. Sentence tokenizer has divided the paragraph into sentences. Lets see how word tokenizer will work on these sentences.

['Ice', 'cream', '(', 'derived', 'from', 'earlier', 'iced', 'cream', 'or', 'cream', 'ice', ')', '[', '1', ']', 'is', 'a', 'sweetened', 'frozen', 'food', 'typically', 'eaten', 'as', 'a', 'snack', 'or', 'dessert', '.']['It', 'may', 'be', 'made', 'from', 'dairy', 'milk', 'or', 'cream', 'and', 'is', 'flavoured', 'with', 'a', 'sweetener', ',', 'either', 'sugar', 'or', 'an', 'alternative', ',', 'and', 'any', 'spice', ',', 'such', 'as', 'cocoa', 'or', 'vanilla', '.']['It', 'can', 'also', 'be', 'made', 'by', 'whisking', 'a', 'flavored', 'cream', 'base', 'and', 'liquid', 'Nitrogen', 'together', '.']

2. Stop Words

As discussed previously not all the words in a sentence are relevant for text processing which helps in reducing noise. Stop words are such words which upon removing from the sentence does not change the crux of it. NLTK has predefined list of stop words which are very common in sentences. Lets look at how to use it for text cleaning.

This results in all the stopwords present in English which are:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Lets apply this on the ice cream paragraph and check the output

['Ice', 'cream', '(', 'derived', 'earlier', 'iced', 'cream', 'cream', 'ice', ')', '[', '1', ']', 'sweetened', 'frozen', 'food', 'typically', 'eaten', 'snack', 'dessert', '.', 'It', 'may', 'made', 'dairy', 'milk', 'cream', 'flavoured', 'sweetener', ',', 'either', 'sugar', 'alternative', ',', 'spice', ',', 'cocoa', 'vanilla', '.', 'It', 'also', 'made', 'whisking', 'flavored', 'cream', 'base', 'liquid', 'Nitrogen', 'together']

It has produced a list of words from the text with out stop words which has reduced lot of noise. But we can still see some noise tokens like ( ] , etc that are not needed in processing as they do not add any meaning. It’s a good practice to filter them before we proceed to further steps. Python provides well defined module for filtering text and it is “re” import re

3. Regex:

A regular expression or regex is used if we want to define a specific pattern in text. A lot of times we will not be interested in punctuations or unwanted symbols in a text and there can be many requirements to filter text based on the problem statement. Given below are few of important patterns which are widely used.

. - match any character except newline
\W - match not word
\D - match not digit
\S - match not whitespace
[^abc] - not match a, b, or c
\w - match word
\d - match digit
\s - match whitespace
[abc] - match any of a, b, or c
[0-9] - match a digit between 0 and 9

As we saw above that our text has few tokens which can be considered as noise, we will now filter them using regex. We want to filter non-word characters. Its easy and computationally fast using regex. Lets check the implementation. Source — https://docs.python.org/3/library/re.html?highlight=regex

re.sub with replace the non-word characters with a space(we can change this as per requirements). This is how the output looks:

Ice cream  derived from earlier iced cream or cream ice  1  is a sweetened frozen food typically eaten as a snack or dessert   It may be made from dairy milk or cream and is flavoured with a sweetener  either sugar or an alternative  and any spice  such as cocoa or vanilla   It can also be made by whisking a flavored cream base and liquid Nitrogen together

Applying tokenization now will remove all the noise that we observed previously. It is a best practice to first filter the text using regex and then perform tokenization.

4. Stemming and Lemmatization:

In the text above, “Ice cream derived from earlier iced cream…” we see “derived” which is past form of “derive”. Rather than having past, future tenses, comparative or superlative forms of a word, it is just enough to have the original word in the processed text. One reason for that is as we know the text is going to be converted into vectors of numbers in later stages, derive and derived have same meaning but 2 different vector will be generated for them if we do not clean them. For this reason, we have two concepts — stemming and lemmatization to reduce inflectional forms of words and conserve them in their base form.

Simply put, just like we normalize data of numbers, we normalize text using stemming and lemmatization.

Stemming usually chops of the suffixes of verbs(‘ed’), adverbs(‘ly’,’lly’), plurals(‘es’,’s’) etc., It does not consider the context of the word. For example, “chairs” is stemmed to “chair”, “roses”->”rose”. It is easier to implement and runs faster.
Lemmatization will derive the base form of a word just like in dictionary. This base word is called lemma. for example, “better” is lemmatized to “good”, “saw”->”see”. Context of the word is preserved.

Let us take first sentence from our text and check the output of stemming and lemmatization.

Output:

Ice cream  derived from earlier iced cream or cream ice  1  is a sweetened frozen food typically eaten as a snack or dessertActual word: Ice
Stem: ice
Lemma: IceActual word: cream
Stem: cream
Lemma: creamActual word: derived
Stem: deriv
Lemma: deriveActual word: earlier
Stem: earlier
Lemma: earlierActual word: iced
Stem: ice
Lemma: iceActual word: cream
Stem: cream
Lemma: creamActual word: cream
Stem: cream
Lemma: creamActual word: ice
Stem: ice
Lemma: iceActual word: 1
Stem: 1
Lemma: 1Actual word: sweetened
Stem: sweeten
Lemma: sweetenActual word: frozen
Stem: frozen
Lemma: freezeActual word: food
Stem: food
Lemma: foodActual word: typically
Stem: typic
Lemma: typicallyActual word: eaten
Stem: eaten
Lemma: eatActual word: snack
Stem: snack
Lemma: snackActual word: dessert
Stem: dessert
Lemma: dessert

This is the best and widely used process to clean text and now this processed text will be used for generating vectors of numbers for each word which is called feature extraction. There are few techniques for it and we will deep dive in next blog.

Hope this blog gave you enough curiosity to get started with NLP and the techniques used for text cleaning.

Check out my next blog here to understand text to feature creation (Bag of Words, TF-IDF) in detail.