Simple POS tagger with 92% accuracy
My goal here is to build a POS tagger using Keras. I initially did some very poor attempts at the tagger with things like bigrams. Next I tried Sequence models. Could not get them to work. They may not be inherently a bad idea for this, but everything about the code felt like overkill. Also I had yet to get it to work, so that really convinced me to drop that idea.
My most recent effort is based on using the features described here for building a tagger.
The encoding of the tags and words had a few gotchas. Although the input features consist of a mixture of words, suffixes, prefixes and tags from other parts of the context, the target is just a POS tag. Leaving the one hot encoding of the target at the overall vocab size - about 47k tokens would have given very poor results. There are less than 50 tags. Needle in haystack stuff. So 2 vocabularies were needed.
I needed to cover the full range of input features in the vocab, so this involved adding tokens for the various word fragments in there too. Not a problem as the training set is very large. ~1MM words.
So features and target appropriately encoded I could build the network.
An Embedding layer which has an input dimension of the sum of the 2 vocabularies (word bits and pos tags).
A Flatten layer so that the output of the Embedding matches the final Dense layer. That dense has a dimension of the POS vocabulary - about 45 in this case.
Training on just a fraction of the total data gave accuracy on the val set of about 92%. This is sufficient for my needs, so I won't take the POS tagger any further for now.
Next up I am going to convert the model to TFLite so that I can consume it in some javascript.
You can find the full code of the project here.
As a rough check that everything was as I expected I ran prediction on the first 1000 examples in the validation set and checked them against ground truth. Printing out the accuracy at every 100 is shown below.
Comments
Post a Comment