Simple POS tagger with 92% accuracy

My goal here is to build a POS tagger using Keras. I initially did some very poor attempts at the tagger with things like bigrams. Next I tried Sequence models. Could not get them to work. They may not be inherently a bad idea for this, but everything about the code felt like overkill. Also I had yet to get it to work, so that really convinced me to drop that idea. 

My most recent effort is based on using the features described here for building a tagger. 

The encoding of the tags and words had a few gotchas. Although the input features consist of a mixture of words, suffixes, prefixes and tags from other parts of the context, the target is just a POS tag. Leaving the one hot encoding of the target at the overall vocab size - about 47k tokens would have given very poor results. There are less than 50 tags. Needle in haystack stuff. So 2 vocabularies were needed.

I needed to cover the full range of input features in the vocab, so this involved adding tokens for the various word fragments in there too. Not a problem as the training set is very large. ~1MM words.

So features and target appropriately encoded I could build the network. 

An Embedding layer which has an input dimension of the sum of the 2 vocabularies (word bits and pos tags).

A Flatten layer so that the output of the Embedding matches the final Dense layer. That dense has a dimension of the POS vocabulary - about 45 in this case. 

Training on just a fraction of the total data gave accuracy on the val set of about 92%. This is sufficient for my needs, so I won't take the POS tagger any further for now. 

Next up I am going to convert the model to TFLite so that I can consume it in some javascript.

You can find the full code of the project here.

As a rough check that everything was as I expected I ran prediction on the first 1000 examples in the validation set and checked them against ground truth. Printing out the accuracy at every 100 is shown below.



Comments

Popular posts from this blog

Execute Jupyter notebooks line by line in VS Code

Using TensorFlow Serving

Text Summarisation with BERT