Creating our own word embeddings with Glove
The Stanford Glove website gives details on how to train your own set of embeddings. This is pretty straightforward using the tools they have provided.
I am interested in seeing if I can get embeddings trained for some of the words in our book which are not in even the large corpus Glove models. Bleak House is a Dickens novel, so it has an old fashioned vocabulary in many ways.
Just using the original text of Bleak House and tokenizing it gives 46 word embeddings that were not captured in the original 1.9MM vocab of Glove. Extending this to cover a much larger range of Dickens novels - any maybe victorian novels in general would probably produce a much better result.
The new words are a strange lot. But maybe that is even more of a reason to be able to provide synonyms.
"'eart",
"'my",
"'now",
"'ouse",
"'prentices",
"'t",
"'this",
"'what",
"'you",
'a-moving',
'ap-kerrig',
'berryin',
'bogsby',
'borrioboola-gha',
'borrioboolan',
'boythorn',
'chadbands',
'coavinses',
'coodle',
'dedlocks',
"for'ard",
'gownd',
'growlery',
'grubble',
"guppy's",
"guv'ner",
'inkwhich',
'law-stationer',
'law-writer',
'pardiggle',
'sangsby',
'shop-door',
'squod',
'swosser',
'terewth',
'thavies',
'tom-all-alone',
'toughey',
'turveydrop',
'unfortnet',
've-ry',
'weevle',
'wiglomeration',
'wiolinceller',
"won't",
"wouldn't"
Comments
Post a Comment