AI on text — Natural Language Processing Basics part 4 — Gensim Word2vec, T-SNE visualisation

Dharani J
4 min readAug 4, 2022

--

Made with https://www.iconfinder.com

In my previous blog, I have explained how word2vec/glove works to create numerical vectors for a given text. Converting text to vectors led to great breakthrough in problem statements like text classification, translation, question answering, text summarisation etc.,. Using Gensim library in python lets understand a working example to obtain vectors for text.

We are going to learn:

  • Gensim word2vec parametres
  • Python example
  • PCA visualisation
  • T-SNE visualisation

Gensim is an open-source library which can be used to create word vectors or embeddings either using Bag of words (CBOW) or Skip-gram method discussed in previous blog.

In Gensim, there are certain parametres to tune to obtain the embeddings as per our use case which are:

  • size: (default 100) The number of dimensions of the vector(embedding), i.e., each token (word) will be represented as a dense vector of length “size”. (100–500 is generally used anything higher might lead to overfit)
  • window: (default 5) In CBOW or Skip-gram, maximum distance between a target word and words around the target word.
  • min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored. This helps in reducing unwanted words with less importance.
  • sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1)

Before feeding text to Gensim model, we must first create tokens of text which is creating words from sentence in a list.

from gensim.models import Word2Vec
# define training data
sentences =
["Fruits are very important for our health.",
"Fruits give us energy and increase the body's ability to fight diseases",
"Eating fruits keeps our hair healthy",
"Fruits are very important for good health",
"There are some fruits - apples, oranges, grapes, bananas",
"Apples are sweet",
"Oranges are sour",
"Bananas have potassium"]
sent = [row.lower().split(' ') for row in sentences]
sent
Converting the sentences to list of tokens
model = Word2Vec(sent, min_count=1) #creating word2vecwords = list(model.wv.vocab)
model['healthy']#this will output the vector for the token specified
This is the vector/embedding of word ‘healthy’ after using the word2vec
Most similar words to apples in the text

The model can be saved and reused. In this way text is mapped to embeddings using Gensim.

#save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')

PCA Visualisation:

Using PCA(You can read my blog about PCA here), we can reduce the 100 dimension to 2 and visualise it using pyplot.

from sklearn.decomposition import PCA
from matplotlib import pyplot
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):
pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()
PCA dimension reduction and plotting the vectors of words

This representations that we have are trained on very few words and its easy to observe them, but in general we get lot of words and using model.most_similar(word)[:n] function in gensim we can get top ’n’ words which are near to a particular word in the whole corpus.

T-SNE Visualization:

t-Distributed Stochastic Neighbour Embedding (t-SNE) is a nonlinear, unsupervised dimension reduction technique used for data exploration and visualising high-dimensional data. It calculates similarity measure between pairs of vectors in the high dimensional space using Guassian distribution and in the low dimensional space by using probability t-distribution(similar to Guassian but with long tail and elongated peak). It then tries to optimise these two similarity measures using a cost function. We measure the difference between the probability distributions of the two-dimensional spaces using Kullback-Liebler divergence (KL). Gradient descent is used in minimizing KL cost function. t-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximise variance. You can read about PCA in my blog here

from sklearn.manifold import TSNE
import numpy as np
labels = []
tokens = []
for word in model.wv.vocab:
tokens.append(model[word])
labels.append(word)
tsne_model = TSNE(n_components=2, random_state=1)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
pyplot.figure(figsize=(8, 6))
for i in range(len(x)):
pyplot.scatter(x[i],y[i])
pyplot.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
pyplot.show()
T-SNE representation of words

In this way we can play around with the text to embedding conversion with the data we have. In general, if we have more contextual data for training, for example bananas, apples, grapes, oranges will be very close together.

  • TF-IDF cannot capture context of text, where word2vec can capture if it has good training data as explained above.
  • We can also use pretrained embeddings like google news embedding or Glove embeddings. They cannot handle Out of vocabulary words.
  • There are ways to handle Out of vocabulary words using fastext library built by facebook. It is highly efficient in creating word embeddings.

Check my previous blog on Word2Vec/Glove

Hope you learnt something new today from my blog. Thank You :)

--

--

Dharani J
Dharani J

Written by Dharani J

Sr. Data Scientist | NLP — ML Blogger

No responses yet