Posts

Execute Jupyter notebooks line by line in VS Code

Normal notebook execution is cell by cell in Jupyter notebooks in the browser. This frequently results in cells needing to be split and rejoined at various points in the dev process. VS Code allows line by line execution of code within those cells. This feels a lot like running SQL code in MS SQL Analyzer and can be a very natural way to work. In addition VS Code gives you a nice view of the data returned from a cell or line execution with its Data Viewer. In WSL Ubuntu to get the line by line stuff working you will need to install the Jupyter Extension and the Python extension. Once these are in place the process of getting VS Code to find your correct python environment is very straightforward. You can find more about line by line debugging here: https://github.com/microsoft/vscode-jupyter/wiki/Setting-Up-Run-by-Line-and-Debugging-for-Notebooks Once the python extension is installed VS Code will find your virtual environments and you pyenv environments.

Pyenv to manage multiple versions of python

Similar to nvm for managing node versions, penv allows you to install and move between different versions of python. Full instructions are here, but a simple introduction (for Ubuntu on WSL) is given here. Install the dependencies: sudo apt-get install -y make build-essential libssl-dev zlib1g-dev \ libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev \ libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev python-openssl Instlal pyenv and add it to your path according to your system specific instructions given in the console: curl https://pyenv.run | bash Then to list version of python that are installed use: pyenv versions The * in the listing shows which one is currently active. To install another use: pyenv install 3.8.13 To make a specific python version active in the current directory use: pyenv local 3.8.13 For the global setting use: pyenv global 3.8.13 Also make sure you don't have any alias stuff messing up the names for python.

Working with imbalanced data

The Stroke dataset on Kaggle is a good lesson on the perils of ignoring an imbalance in your data for a binary classifer. Last week I trained a Random Forest Classifier on this data to predict likelihood of stroke given some data about the patient. The model was 95% accurate out of the box. Sounds good right? It turns out that the data is highly imbalanced. There are a lot more 'no-stroke' patients than those that had a stroke. If you create a classifier that just predicts no-stroke for everyone it will get very high accuracy. So this model is rubbish.  A better measure will take into account the rate of false positives and the rate of false negatives. Those stroke patients in the data are the false negatives. To do this we will use ROC curves. These vary the threshold of discrimination of the classifier across a range to determine what the false positive and false negative rates are at each of these thresholds. Plotting this gives an indication of the models performance in a m...

Adding React Router to out app

As it stands our app will get very untidy as we add functions. We can separate components and import them to a main file as needed, but navigation for the user is still going to be messy, or difficult to set up.  React Router lets us use links to React Components to tidy our user experience up. So I have separated out the (blank) home page from the New Book and list components. React Router is here .

Using curl scriptlets from a single file in VS Code

VSCode is a big step forward for code development. I found myself moving between consoles to run curl and the IDE to code in python on javascript. Easier if I could run lines of curl from a single bigger file of scripts. The way a notebook works.  Turns you can can do just that. Pull up the command list Cmd+shift+P. Locate 'run selected text in active terminal'. Click on the settings cog on the right hand side. Choose a shortcut key binding that does not clash with something else. VSCode will warn you.  And that is is. Saves a lot of hassle. Much easier to read and edit these than using the command line.

Creating our own word embeddings with Glove

The Stanford Glove website gives details on how to train your own set of embeddings. This is pretty straightforward using the tools they have provided. I am interested in seeing if I can get embeddings trained for some of the words in our book which are not in even the large corpus Glove models. Bleak House is a Dickens novel, so it has an old fashioned vocabulary in many ways.  Just using the original text of Bleak House and tokenizing it gives 46 word embeddings that were not captured in the original 1.9MM vocab of Glove. Extending this to cover a much larger range of Dickens novels - any maybe victorian novels in general would probably produce a much better result. The new words are a strange lot. But maybe that is even more of a reason to be able to provide synonyms. "'eart", "'my", "'now", "'ouse", "'prentices", "'t", "'this", "'what", "'you", ...

Synonyms using Glove

If you get a set of pretrained word embeddings like Glove from Stanford you can create a synonym generator very easily. First use Location Sensitive Hashing to split up the vector space of all words in your vocabulary into a number of partitions. This is as simple as determining the range of each of your embeddings and picking uniform random point between these ends. A set of these random points - one for each of your embedding dimensions - constitutes a hyperplane. You can multiply the word vector of a word by this hyperplane and sum the results. The sign of the result -1, +1 or 0 gives you the key part for that hyperplane. Repeat for all of the other hyperplanes and you can form a tuple of these key parts. Key a dict with these and a list of the associated words and you are good to go.  So when you want to find a synonym you can find the bucket of words with the same hash key and calculate the Cosine Similarity for each of them. Sort by these similarities and you have your synony...