Feature Extraction from texts using Word2vec

To get a better semantic understanding of a word, word2vec was published for nlp community.

First, I will show you how to use pretrained word2vec model by Google trained on Google News dataset. Though it seems easy but the statistics behind the load time of model due to it’s humongous size is something to ponder upon.

Simple model loading using genism and doing some magic. Take a look at execution time mentioned in script comments.

As can be seen from stats mentioned, it took around 55 sec in loading the model & around 15 sec to compute next three statements, everytime I had to run this model, which is Ugly. Moreover you can easily check the sharp rise in your memory usage.

Ok, so far so good. Now let’s try to make model loading faster. But how ?

There is a param in genism named limit which sets a maximum number of word-vectors to read from the file. The default None, means read all. This allows us to get the output we were earlier seeking from around 70(55 + 15) sec to 10 sec (8.43 +3.29). I accept this is indeed not the best way to get away with these, but yes simplest for sure.

But if you’re re-loading for every web-request, you’ll still be hurting from loading’s IO-bound speed, and the redundant memory overhead of storing each re-load.

Revisit this, this & this to research more on this.