Sequence models are not the right fit for POS taggers

February 24, 2021

So I was trying to train a sequence model to predict the POS tag for a word. The idea I was going with was that POS tags in a body of text follow some sort of sequence - much the same way as words do. This sequential nature of Natural Language is a cornerstone of NLP.

On reflection - and having tried unsuccessfully to train a Sequence model - I have come to the conclusion that this does not work.

Here are some of the things I tried

I changed the labelled training data into a long list of tuples of Word and POS tag. Stuff like ('run', 'VB'). I split this up into a windowed dataset - i.e. I created subsequences of n input tuples and 1 target tuple, n being the sequence length which I arbitrarily set to 20.

Then I created dictionaries for the various vocabularies - these were not word vocabularies, but tuple vocabularies. So an index number for each of the (word, POS) tuples in the training data. I translated the training data into a set of integer lists in this way.

I also created a vocabulary for the POS tags themselves. This was so I could encode the targets which of course were not words. Finally I added the word which we want to find the tag for to the end of the training data list.

So it all ends up looking something like this:

x = [(word1, POS1), (word2, POS2)...(wordn)]

y = [pos for wordn]

but encoded as integers.

This gave even worse results than the previous (very basic) probabilistic methods. There are probably significant improvements I could make to this approach to get better results, but overall I think this level of complexity should be unnecessary for a problem like POS tagging.

As I mentioned the NLTK POS tagger gives excellent results. I decided to take a look at what they had done. That lead me here. Excellent resource.

So my task for tomorrow is to build a simpler feed forward neural network using the same features as the NLTK tagger and see how I get on.

Some learnings

Google Colaboratory provides free GPU instances for experimentation. For larger models you are likely to run out of memory, but colab significantly outperforms my new macbook pro. Their TPUs are even faster, but can consume all of your free limit very quickly, so best to steer away from TPU unless you are prepared to upgrade for continued service.

I learned quite a bit about Tensorflow Datasets on another Coursera course that I did last year. However having not used it in a while I found it difficult to get back into. A powerful technology, especially for larger datasets - where it managers memory very well - but a steep learning curve. Down which I appear to have slid :(

Keras is a great library. There is still quite a bit that I don't understand about it, but it seems to be one that is here for the long haul. Well worth the time invested learning it.

Search This Blog

30 days of ML