Synonyms using Glove

If you get a set of pretrained word embeddings like Glove from Stanford you can create a synonym generator very easily.

First use Location Sensitive Hashing to split up the vector space of all words in your vocabulary into a number of partitions. This is as simple as determining the range of each of your embeddings and picking uniform random point between these ends. A set of these random points - one for each of your embedding dimensions - constitutes a hyperplane. You can multiply the word vector of a word by this hyperplane and sum the results. The sign of the result -1, +1 or 0 gives you the key part for that hyperplane. Repeat for all of the other hyperplanes and you can form a tuple of these key parts. Key a dict with these and a list of the associated words and you are good to go. 

So when you want to find a synonym you can find the bucket of words with the same hash key and calculate the Cosine Similarity for each of them. Sort by these similarities and you have your synonyms.

Note that this is approximate nearest neighbours. If you carry out the cosine similarity with all of your word vectors as a (slow) experiment you will see that you get a fuller list of similar words the different elements of which were excluded by the approximate nature of this algorithm.

Initially I tried this out with a Glove vector set with a 400k vocab. I will try with a much larger set ~1.2 MM to see if I get more refined results.

Comments

Popular posts from this blog

Execute Jupyter notebooks line by line in VS Code

Using TensorFlow Serving

Text Summarisation with BERT