Working with imbalanced data
The Stroke dataset on Kaggle is a good lesson on the perils of ignoring an imbalance in your data for a binary classifer. Last week I trained a Random Forest Classifier on this data to predict likelihood of stroke given some data about the patient. The model was 95% accurate out of the box. Sounds good right? It turns out that the data is highly imbalanced. There are a lot more 'no-stroke' patients than those that had a stroke. If you create a classifier that just predicts no-stroke for everyone it will get very high accuracy. So this model is rubbish.
A better measure will take into account the rate of false positives and the rate of false negatives. Those stroke patients in the data are the false negatives. To do this we will use ROC curves. These vary the threshold of discrimination of the classifier across a range to determine what the false positive and false negative rates are at each of these thresholds. Plotting this gives an indication of the models performance in a more realistic way that just looking at accuracy.
Details of ROC can be found here.
In addition to the curve there is a score. Basically this is an integral of the area under that curve. Closer to 1 is better. My current naive model gives a ROC score of about 0.75.
Next we will try some techniques to deal with the imbalance and hopefully improve the ROC score.
Comments
Post a Comment