AI on text — Natural Language Processing Basics part 3— Word2Vec/GloVe
In previous blogs we saw how to create vectors from a text and understood the limitations that they are sparse representations which consume lot of memory and there is no context that is preserved in that setting. To overcome these limitations we have a simple yet powerful concept called word embeddings. To give a flavour of what we will be learning, consider two statements:
- I like travelling by aeroplane
- I like travelling in flight.
The two statements have same meaning but we get different vector representations in BoW and TF-IDF setting. We understood that these methods create very sparse representations(many zeros less values in vector matrix) of vectors which are computationally heavy and context of the usage of word is completely lost. But in word embeddings, we can clearly conclude above two sentences are very similar as the words ‘aeroplane’ and ‘flight’ will have similar embedding scores. It is really important to preserve the semantics or context of a word in the sentence.
An important aspect is the very popular example embedding of ‘king’ -‘man’ + ‘woman’ will result in embedding of ‘queen’. A good analogy will be representing words in numerical vectors just like we represent colors in RGB format. I am sure this has given you enough curiosity to keep reading till the end of the blog.
Word2Vec:
This is one of the most popular methods to create word embeddings. Due to its fast computation and open source availability it is very commonly used. We create these embeddings using deep learning. There are 2 main concepts for creating these word2vec embeddings. First method is CBOW or continuous bag of words model. In this setting, we try to predict a word given its neighbors or its context and get the embedding vector. We can pick the window size which is the number of neighboring words we want to consider for training.
Let us observe its working with an example:
If we want to predict a word given its neighboring words, the main parameter to change for CBOW model is adjusting its window size. For the sentence “Word embeddings preserve semantic context of the words”, if we consider the window size is 2, then its 2 left and 2 right neighboring words are fed as input to the training model(X) and Y will be our target word. Basically we are considering some surrounding words for training purpose to predict the middle word.
Then we create context pairs which will be in this example ([‘word’, ‘embeddings’, ‘semantic’, ‘context’], ‘preserve’), ([‘embeddings’, ‘preserve’, ‘context’, ‘of’], ‘semantic’), ([‘preserve’, ‘semantic’, ‘of’, ‘the’], ‘context’), ([‘semantic’, ‘context’, ‘of’, ‘the’], ‘of’). With these word pairs, the model tries to predict the target word given the context words. For example, if X = [‘word’, ‘embeddings’, ‘semantic’, ‘context’] and our Y is ‘preserve’. We obtain the vocabulary first which is [‘Word’, ‘embeddings’ , ‘preserve ’ , ‘semantic ’, ‘context ’, ‘of ’, ‘the’ , ‘words’] and then one hot encode the words to give as input.
Wt-2(‘word’) will be [1,0,0,0,0,0,0,0], Wt-1(‘embeddings’) = [0,1,0,0,0,0,0,0], Wt+1(‘semantic’) = [0,0,0,1,0,0,0,0], Wt+2(‘context’) = [0,0,0,1,0,0,0,0].
Let’s say we have a vocabulary of N words and for example as we saw above We have 4 context words used for predicting one target word. So the input layer will be in the form of four 1XN input one hot encoded vectors as we got above. The context word vector will pass through an embedding matrix NxV to a dense vector representation(with random weights at each value in matrix) of dimension V, where each word has a respective V sized densely represented vector. All these vectors will now be passed to Lambda layer where these V vectors will be summed up or averaged out. This output form lambda layer will be now fed into final softmax function which will predict the probability of target word.
The values in the matrix are first random values and will be updated via back propagation. After few back propagations the weights or values in our embedding matrix will be updated accordingly by minimizing the error obtained while predicting the accurate target word.
One problem we might encounter is that in each sample during back propagations only the target word embeddings might get updated. While training a neural network model, in each back-propagation we try to update all the weights in the hidden layer. The weight corresponding to non-target words would receive a marginal or no change at all, i.e., in each pass we only make very sparse updates. To overcome this, we can try negative sampling
Negative sampling will only modify a small percentage of the weights, rather than all of them for each training sample. We do this by slightly modifying our problem. Instead of trying to predict the probability of being a nearby word for all the words in the vocabulary, we try to predict the probability that our training sample words are neighbors or not. This reduces the computational expense. Previously softmax has to classify each word(each class) with some probability score. Now it has to classify if its neighbor or not for each word.
Skip Gram:
In CBOW we update embeddings by trying to predict a target word given its surrounding words. Skip Gram is exact opposite, we try to predict the context or surrounding words when given a word. For example, in the similar sentence as we have seen for CBOW, the inputs in Skip Gram would vary like this:
Working in neural network setting is the same as that of CBOW with an input layer, hidden layer, lambda, and softmax. So I am not going into much details as the concept is clear here.
Skip Gram works well with small amount of data and is found to represent rare words well. On the other hand, CBOW is faster and has better representations for more frequent words.
In this manner word2vec method tries to obtain embeddings for words which can retain semantics. One meta parameter is the dimension size, which can be of our choice, it can vary anything from 2 to 500 or even more. Which mean for example, for embedding dimensions of 100, word ‘King’ will be expressed as a dense vector with values in shape of 1X100. Though this is seen as a breakthrough in NLP, there are certain limitations in this architecture:
- Out of Vocabulary: We train a word2vec on some training data we have and it has got all the embeddings for the available vocabulary. Now in test data if there is a new word which is not in this vocabulary, then it cannot assign embedding.
- Polysemy: Bank in ‘river bank’ and ‘state bank’ will have same embeddings though they are completely different.
- Some anomalies in the embeddings like, “happy” and “sad” are usually located very close to each other in the vector space, which may limit the performance of word vectors in NLP tasks like sentiment analysis.
- Local context based method like Word2Vec are known to fail capturing the global statistic/structure of the corpus.
GloVe (GLObal VEctors)
There are few more approaches to address few of these Issues. One of it is GloVe. To understand GloVe lets first understand Co occurrence based models. The main intuition behind it is — strong association between the words can be understood by analyzing their occurrence in all documents in the available corpus. The co-occurrence of words can reveal much about their semantic proximity and meaning.
We will need a measure to quantify the co-occurrence of 2 words “W1”, “W2”. Pointwise Mutual Information (PMI) is very popular co-occurrence measure.
PMI : p(w) is the probability of the word occurring, and p(w1,w2) is joint probability. High PMI indicate strong association between the words.
Co-occurrence methods are usually very high dimensional and require much storage. We can use dimensionality reduction techniques to handle high dimensional data. But due to huge storage requirements these models are not able to replace the static Word2Vec embeddings.
For example in sentence-”Word embeddings preserve semantic context of the words, word embeddings are awesome”, [word, embeddings] occur 2 times so we fill 2 and so on
GloVe Training
Unlike Word2vec, the advantage of GloVe is that, it does not rely just on local statistics (local contextual information of words), but involves global statistics (word co-occurrence) to get the word vectors. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus as seen above. The resulting representations showcase interesting linear substructures of the word vector space.
Pennington et al. (2014) present a simple example based on the words ice and steam to illustrate it. Let P(k|w) be the probability that the word k appears in the context of word w: ice co-occurs more frequently with solid than it does with gas, whereas steam co-occurs more frequently with gas than it does with solid. Both words co-occur frequently with water (as it is their shared property ) and infrequently — with the unrelated word cycle.
In other words, P(solid | ice) will be relatively high, and P(solid | steam) will be relatively low. Therefore, the ratio of P(solid | ice) / P(solid | steam) will be large. If we take a word such as gas that is related to steam but not to ice, the ratio of P(gas | ice) / P(gas | steam) will instead be small. For a word related to both ice and steam, such as water we expect the ratio to be close to one
In the Co-Occurrence matrix shown is matrix X, where a cell Xij is a represents how often the word Wi appears in the context of the Wj in the corpus or count of times that Wi and Wj co-occur in the corpus.
We will need to build word vectors that represents the co occurrence of words for each i and j. The way GloVe predicts surrounding words is by maximizing the probability of a context word occurring given a center word by performing a dynamic logistic regression. Once X is ready, it is necessary to decide vector values in continuous space for each word in the corpus, in other words, to build word vectors that show how every pair of words i and j co-occur.
We’ll do this by minimizing an objective function J, which evaluates the sum of all squared errors based on the above equation, weighted with a function f:
Where V is the size of the vocabulary. A soft constraint(Cost function like in all tasks) is set for each word pair of word i and word j which is the sum of terms Wi,Wj with bi and bj in above equation.
Sometimes certain co occurrences happen rarely and are noisy or carry less information. To discount them a weighted least squares regression model is used as shown below
There are few advantages and disadvantages of using GloVe
It is fast to train and scalable to any amount of corpus. During training also we can stop training as soon as we see improvements are unrecognisable. But at the same it uses a lot of memory - the fastest way to construct a term-co occurrence matrix is to keep it in RAM as a hash map and perform co occurrence increments in a global manner. GloVe training is sometimes quite sensitive to initial learning rate.
I hope this blog provides you with enough information on two important methods in creating contextual vectors. In upcoming blogs we will learn about further developments from W2V and GloVe that are used in NLP.
Check my previous blog on BagofWords/TFIDF and next blog on Word2Vec implementation
Thank You! :)