Posts

Showing posts from March, 2021

Working with imbalanced data

The Stroke dataset on Kaggle is a good lesson on the perils of ignoring an imbalance in your data for a binary classifer. Last week I trained a Random Forest Classifier on this data to predict likelihood of stroke given some data about the patient. The model was 95% accurate out of the box. Sounds good right? It turns out that the data is highly imbalanced. There are a lot more 'no-stroke' patients than those that had a stroke. If you create a classifier that just predicts no-stroke for everyone it will get very high accuracy. So this model is rubbish.  A better measure will take into account the rate of false positives and the rate of false negatives. Those stroke patients in the data are the false negatives. To do this we will use ROC curves. These vary the threshold of discrimination of the classifier across a range to determine what the false positive and false negative rates are at each of these thresholds. Plotting this gives an indication of the models performance in a m...

Adding React Router to out app

As it stands our app will get very untidy as we add functions. We can separate components and import them to a main file as needed, but navigation for the user is still going to be messy, or difficult to set up.  React Router lets us use links to React Components to tidy our user experience up. So I have separated out the (blank) home page from the New Book and list components. React Router is here .

Using curl scriptlets from a single file in VS Code

VSCode is a big step forward for code development. I found myself moving between consoles to run curl and the IDE to code in python on javascript. Easier if I could run lines of curl from a single bigger file of scripts. The way a notebook works.  Turns you can can do just that. Pull up the command list Cmd+shift+P. Locate 'run selected text in active terminal'. Click on the settings cog on the right hand side. Choose a shortcut key binding that does not clash with something else. VSCode will warn you.  And that is is. Saves a lot of hassle. Much easier to read and edit these than using the command line.

Creating our own word embeddings with Glove

The Stanford Glove website gives details on how to train your own set of embeddings. This is pretty straightforward using the tools they have provided. I am interested in seeing if I can get embeddings trained for some of the words in our book which are not in even the large corpus Glove models. Bleak House is a Dickens novel, so it has an old fashioned vocabulary in many ways.  Just using the original text of Bleak House and tokenizing it gives 46 word embeddings that were not captured in the original 1.9MM vocab of Glove. Extending this to cover a much larger range of Dickens novels - any maybe victorian novels in general would probably produce a much better result. The new words are a strange lot. But maybe that is even more of a reason to be able to provide synonyms. "'eart", "'my", "'now", "'ouse", "'prentices", "'t", "'this", "'what", "'you", ...

Synonyms using Glove

If you get a set of pretrained word embeddings like Glove from Stanford you can create a synonym generator very easily. First use Location Sensitive Hashing to split up the vector space of all words in your vocabulary into a number of partitions. This is as simple as determining the range of each of your embeddings and picking uniform random point between these ends. A set of these random points - one for each of your embedding dimensions - constitutes a hyperplane. You can multiply the word vector of a word by this hyperplane and sum the results. The sign of the result -1, +1 or 0 gives you the key part for that hyperplane. Repeat for all of the other hyperplanes and you can form a tuple of these key parts. Key a dict with these and a list of the associated words and you are good to go.  So when you want to find a synonym you can find the bucket of words with the same hash key and calculate the Cosine Similarity for each of them. Sort by these similarities and you have your synony...

POS Prediction on the Server side

Perviously I got the POS tracking model working in Javascript having built the model in python. This necessitates quite a bit of transcription of code from python to javascript. Realistically running inference in the browser is not going to be a common use case at least in the near term. So for this reason I have changed the flask app to take a paragraph of text, encode it, run inference and then decode the response.  I am keeping this response - called words along side each original paragraph in the CouchDB database.  For now I am not showing that on the front end, but that will come shortly.

Using TensorFlow Serving

Having converted the front end to React the next step is to get a model making predictions from the server. TensorFlow Serving seems like the best way to do this. Instructions are here  for using Serving with Docker which is the recommended approach.  Some tips on docker: After you have run docker pull to get the TensorFlow Serving container you will want to list your containers: docker ps To run a container use docker run as per the instructions.  docker run -p 8501:8501 --mount type=bind,source=$(pwd)/data/model/pos_checkpoint,target=/models/pos_checkpoint -e MODEL_NAME=pos_checkpoint -t tensorflow/serving Then use docker stop XX where XX is enough of the tag from docker ps to identify that container.  You will need to put your SavedModel into an integer folder for the version.  The command to run a prediction using the REST endpoint is: curl -d '{"instances":[1,2,3]}' -X POST http://localhost:8501/v1/models/pos_checkpoint:predict That will error because the ...

Apache CouchDB to store text

So far the application will find Parts of Speech and Named Entities. For anything more than this we really need a way of working on a larger scale piece of work like a novel. I have chosen Apache CouchDB for this purpose. It's a simple json based database. There are handy bulk insert operations so that all 5419 paragraphs of the book can be uploaded in about a second. docs = [] for id, content in enumerate(cleaned_paragraphs):     docs.append({"id": id, "content": content}) db.bulk_docs(docs) Next up is to apply our POS tagger to the text on a paragraph by paragraph basis, or globally for the whole novel. The fact that couchdb is unstructured is a huge advantage here. We can update the structure of some paragraphs to have POS tags for example and not the structure of others. Our python code can decide how to deal with the returned structures.

Text Summarisation with BERT

Hugging Face provides a wide range of pretrained models for lots of machine learning tasks. Summarising the chapters of our novel - Bleak House - would be an interesting challenge. This could also be a very useful mnemonic technique if you were reviewing a long piece of work.  So here is the Hugging Face instructions on this. pip install transformers Then in python: from transformers import pipeline summarizer = pipeline("summarizer") Give it the text to summarize along with some params summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False) BERT is a large model and takes about 15 minutes to download for me. However it is then a pip package and so will be available next time you try to run it. 

Experimenting with pretrained embedding vectors

The GloVe word vectors from Stanford are a good place to start with using pretrained weights in an embedding layer. They consist of a large vocabulary - about 400k words and a word vector for each one. They come in various vector lengths from 50 to 300.  When you are adding an embedding layer in keras you can specify the weights to use and set the layer as untrainable.  So this is my model: def get_pretrained_model():     model = keras.Sequential()     model.add(layers.Embedding(len(word_index)+1, 200, input_length=max_len, weights=[embeddings_matrix] , trainable=False))     model.add(layers.Bidirectional(layers.LSTM(32)))     model.add(layers.Dense(6, activation='relu'))     model.add(layers.Dense(1, activation='sigmoid')) Despite training this on a Colab GPU it still only gets to just above 80% accuracy. I am using the IMDB sentiment dataset and may be coming up against the limits of the length of that dataset.

Using Tensorflow Data Services makes sequence modelling much easier

I have switched my sentiment model to use the very nice features of TFDS for sequence preprocessing. So instead of writing my own vocab, word_to_index dictionary, encoding and padding my text, I can just use the canned ones in tfds.  from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences As always the code is in the git repo . 

Sentiment Analysis based on IMDB Movie Review dataset

Image
So the next step in analysing novels is to create a sentiment analyser. I got the IMDB movie review dataset from Kaggle here . There are a number of challenges here. The sentiment analysis I have done before was based on Twitter data. Short input data. Some of the inputs to the movie review db are over 1000 words long. So a histogram of the lengths of the reviews looks like this: A cutoff of 1000 max words seems sensible as a first pass. That still constitutes a lot of uninformative padding, but I can refine the method later.  So my first model looks like this: This gives categorical accuracy of about 84%. Not great, but could be worse.  I ran a bunch of predictions on movie reviews. Some that I made up and others from the test set. The model was 'correct' most of the time, but was very close to the middle of the distribution every time. It does not inspire much confidence.  So before I try to get this working on a long piece of work like a novel I need to improve the mod...

Extracted some common functions

I have tidied up the code for both the Parts of Speech (POS) tagger and the Named Entity Recognition (NER) system. This involved extracting a couple of functions into a module for reuse and removing any of the investigation code used to work out what the model was doing.  Now running both of these notebooks in sequence delivers 2 models which can be converted to tfjs and run in the client. So far the tfjs converter just handles the POS model. Will get the NER model running next. 

Named Entity Recognition for Bleak House

Image
The Named Entity Recognition is now working on the python side. So to recap: I trained a POS tagger to 90+% accuracy Used this as part of the feature set for a Named Entity Recognition model Tested this out on a novel - Bleak House So this is what the results look like. These are the most common named entities in the book. They match well with my knowledge from reading it.  Next up I want to get this working on the front end. Will tidy the code for the POS and NER up. Write the JavaScript version of the NER prediction code and get the novel loading code working. It would be interesting to see where these named entities feature throughout the book. A timeline of some sort would be good there. 

POS tagger working for large body of text

Image
I have fixed the POS tagger so that it can take a large body of text - a book in this case - and tag all of the words. I output the tagged corpus as a pandas dataframe to CSV. I will use this in the next stage which is to predict Named Entities from that corpus.  Named Entities can span more than one word. e.g Mr. Stevens is a single Named Entity, but has 2 tags. Along with working on the models I have been setting up some CSS to display this content nicely. So far this is just static code, but integrating this into the working browser based POS tagger should be straightforward enough. This is what it looks like so far. Words with their POS tags directly below and the possibly multi word Named Entity tags below that on the tile.  The predictions seem to be very slow at the moment. I am however running them singly. I think there are about 250k words in that book. Looking like taking more than an hour at this rate. I may also have overcomplicated the model in an effort to get hi...

Modifying the POS tagger to predict on large body of test - python

So the output from the tensorflow.js model works well and the POS tags that are being predicted match well those in the training text.  Next I am trying to run some predictions on a larger body of text. Extracting the Named Entities from a novel is one of the goals of this project. I have tested the predictions code on the python side, but just using the training data as input. This is fine for rough testing, but I need the python to be able to accept a large string of text and apply POS tags to every word.  I downloaded Bleak House from Project Gutenberg. Stripped out some of the pre and post boilerplate and tried to run some predictions on that.  I realised at this point that I had not made any allowance for unknown words. When you build your vocab from the training data and test with that too, this does not come up. I used defaultdicts to get this working.  The tags being predicted look wrong at this stage, but I will debug next. Once the POS tagger works on the p...

A Working POS tagger in tensorflow.js

Quite a tricky thing to debug, but the full POS tagger is now working. You can find the code here . The challenge of encoding and decoding the data on the client was quite tricky. The JavaScript and layout is still very basic, but sufficient for now.  I have also improved the model to be about 96% accurate from the previous 92%. This involved adding a dense layer, dropout and training on more data. The WSJ dataset is very large, so I am still only using a fraction of this 50k out of 850k. I also removed a bunch of non POS tags. Things like punctuation. They may be useful for training some models, but mine just relied on words.  Next up is to train a Named Entity Recognition model using these Parts of Speech as part of the input features. 

Recreating python feature extraction code in JavaScript

So TensorFlow's new Preprocessing Layers will make the use of models from python in TensorFlow.js much easier. At the moment those layers are only available in python, not JavaScript, so there is some transcription to be done.  The original model was trained by creating a windowed dataset with a sequence length of 5. In order to get prediction working in the browser that code needs to be replicated.  function createWindowedDataset ( data ){ let windowed = []; for ( let i = 0 ; i < data . length - contextSize ; i ++){ windowed [ i ]= data . slice ( i , i + contextSize ); } return windowed ; } Now my model loading function has changed to include a loop which loads a set of JSON files. There are 4 of these required to get the model predicting and to make sense of the predictions. So that code looks like this: const jsonToLoad = [ 'word_to_index.json' , 'pos_to_index.json' , 'most_common_tag_for_word.json' , 'index_to_...