All you need to know about Word2vec

Why is word embedding needed?

Purpose is to create a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in.

And all of these are implemented by using Word Embeddings or numerical representations of texts so that computers may handle them.

What is word embedding ?

In very simplistic terms, Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. But before we dive into the details of Word Embeddings, the following question should be asked – Why do we need Word Embeddings?

As it turns out, many Machine Learning algorithms and almost all Deep Learning Architectures are incapable of processing strings or plain text in their raw form. They require numbers as inputs to perform any sort of job, be it classification, regression etc. in broad terms. And with the huge amount of data that is present in the text format, it is imperative to extract knowledge out of it and build applications. Some real world applications of text applications are – sentiment analysis of reviews by Amazon etc., document or news classification or clustering by Google etc.

Let us now define Word Embeddings formally. A Word Embedding format generally tries to map a word using a dictionary to a vector. Let us break this sentence down into finer details to have a clear view.

Take a look at this example – sentence=” Word Embeddings are Word converted into numbers ”

A word in this sentence may be “Embeddings” or “numbers ” etc.

A dictionary may be the list of all unique words in the sentence. So, a dictionary may look like – [‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]

A vector representation of a word may be a one-hot encoded vector where 1 stands for the position where the word exists and 0 everywhere else. The vector representation of “numbers” in this format according to the above dictionary is [0,0,0,0,0,1] and of converted is[0,0,0,1,0,0].

This is just a very simple method to represent a word in the vector form. Let us look at different types of Word Embeddings or Word Vectors and their advantages and disadvantages over the rest.

Different types of Word Embeddings

The different types of word embeddings can be broadly classified into two categories-

  1. Frequency based Embedding
  2. Prediction based Embedding

Let us try to understand each of these methods in detail.

2.1 Frequency based Embedding

There are generally three types of vectors that we encounter under this category.

Count Vector TF-IDF Vector Co-Occurrence Vector

Let us look into each of these vectorization methods in detail.

To get a better semantic understanding of a word, word2vec was published for nlp community.

First, I will show you how to use pretrained word2vec model by Google trained on Google News dataset. Though it seems easy but the statistics behind the load time of model due to it’s humongous size is something to ponder upon.

Simple model loading using genism and doing some magic. Take a look at execution time mentioned in script comments.

As can be seen from stats mentioned, it took around 55 sec in loading the model & around 15 sec to compute next three statements, everytime I had to run this model, which is Ugly. Moreover you can easily check the sharp rise in your memory usage.

Ok, so far so good. Now let’s try to make model loading faster. But how ?

There is a param in genism named limit which sets a maximum number of word-vectors to read from the file. The default None, means read all. This allows us to get the output we were earlier seeking from around 70(55 + 15) sec to 10 sec (8.43 +3.29). I accept this is indeed not the best way to get away with these, but yes simplest for sure.

But if you’re re-loading for every web-request, you’ll still be hurting from loading’s IO-bound speed, and the redundant memory overhead of storing each re-load.

Revisit this, this & this to research more on this.

Leave a comment