POS tagger working for large body of text

March 04, 2021

I have fixed the POS tagger so that it can take a large body of text - a book in this case - and tag all of the words. I output the tagged corpus as a pandas dataframe to CSV. I will use this in the next stage which is to predict Named Entities from that corpus.

Named Entities can span more than one word. e.g Mr. Stevens is a single Named Entity, but has 2 tags. Along with working on the models I have been setting up some CSS to display this content nicely. So far this is just static code, but integrating this into the working browser based POS tagger should be straightforward enough. This is what it looks like so far. Words with their POS tags directly below and the possibly multi word Named Entity tags below that on the tile.

The predictions seem to be very slow at the moment. I am however running them singly. I think there are about 250k words in that book. Looking like taking more than an hour at this rate. I may also have overcomplicated the model in an effort to get higher accuracy. This will slow things down too.

Investigating the performance tradeoffs here will be well worth some future effort.

Search This Blog

30 days of ML

POS tagger working for large body of text

Comments

Post a Comment

Popular posts from this blog

Execute Jupyter notebooks line by line in VS Code

Using TensorFlow Serving

Text Summarisation with BERT