Playing with Part of Speech (POS) taggers

Breaking text up into parts of speech is useful for a variety of tasks. When a query is sent to a search engine for example the part of speech of a word in the query will be a significant determinant of the results. 

Overall the aim of this project is to build a Named Entity Recognition system. But one of the useful feeds into this is creating a POS tagger. This will demonstrate some of the related techniques for working with language and the generated Parts of Speech will probably be useful features in the more advanced Named Entity Recognition model. 

So yesterday I said I would identify the data to use for training and testing and set up the shell of a github project.

Here is the project and the data I am using is from the Wall Street Journal. It was supplied as part of the Coursera course. However any POS tagged dataset should be fine. 

I tried 3 approaches

Find the most common tag for each word in the training set. Just use these (and a tag for Unknown) to make your predictions. A pretty basic method, but it gave about 59% accuracy, so a lot better than random guessing. 

Next I tried the same approach, but for bigrams. As expected this gives a much better result of about 70%. Still not great, but not too bad. 

Finally I tried a state of the art tagger which is part of the NLTK project. With minimal fuss this gives over 96% accuracy. Leagues ahead of my two efforts. 

So my next task is to build a sequence model with Keras to see if I can improve on my efforts here. The full code can be found in the repo above.

Comments

Popular posts from this blog

Execute Jupyter notebooks line by line in VS Code

Using TensorFlow Serving

Text Summarisation with BERT