Handling Imbalance in Datasets

This blog gives an overview of the problems faced with imbalanced training datasets. How to deal with them and What are the various approaches to handle them?

So let’s start with what is Imbalance in the Dataset problem.

Imbalance is one of the most common business problems in data. Almost all business problems have data disparity problems in some form or the other. Data imbalance is referred to as a situation in which the number of observations in one target class is greater than the observations in another class (binary problem) or classes (multi label problem). 

As an example, consider the classification of pixels (possibly cancerous) in mammogram images (Woods, Doss, Bowyer, Solka, Priebe, & Kegelmeyer, 1993). A typical mammography dataset might contain 98% normal pixels and 2% abnormal pixels.

A simple default strategy of guessing the majority class would give a predictive accuracy of 98%. But would you consider that a good strategy? Imagine we say every observation is non-cancerous – so we get an amazing accuracy of 98% but commit a life-threatening blunder.

The above image shows data imbalance in a binary classification problem. Most of the time we have to detect these minority classes only. Sometimes these data points are also called Anomalies. Here is a t-SNE plot of imbalance data.

Let’s first understand where we need to handle the imbalance in data. 

What is Anomaly Detection?

Anomaly detection aims to differentiate between “normal” and “anomalous” observations.We can use any of the supervised, unsupervised and semi-supervised algorithms for detection. There can be many types of anomalies such as point anomaly, continuous anomaly, group anomaly etc. 

point anomaly [Source]

What is Classification ?

Classification is a type of supervised learning which is used to classify observations into two or more classes. Its target is always categorical. We only use supervised algorithms for it. Imbalance in data is only handled in classification.

Example of binary classification [Source]

How to handle Imbalanced Data ?

There are many techniques that can be used to handle this disproportion in data. Here is a rough outline of the useful approaches. These are listed approximately in order of effort:

Over and Under Sampling  [Source]
  1. Do nothing. Sometimes you get lucky and nothing needs to be done. You can train the so-called natural (or stratified) distribution and sometimes it works without the need for modification.
  2. Balance the training set in some way:
    • Oversample the minority class: As shown in the figure above, data with minority samples are duplicated multiple times until it is equal to the number of the majority class.
    • Undersample the majority class: samples of data are taken from the majority class until the class distribution is equal. This reduces the size of the training dataset.
    • Synthesize new minority classesSMOTE  is a synthetic data generation technique. It artificially increases the number of minority classes until the number of rows in all classes is equal. Here is a link to read more about SMOTE.
    • Remove confusing data points: Tomek Links. It removes the border or confusing data points and hence helps in making a clear decision boundary between either class.
Tomek Links [Source]

  1. At the algorithm level, or after it:
    • Adjust the decision threshold: A lot of classification algorithms like logistic regression predict the probability of belonging to either class. By default, a threshold of 0.5 is set but we can tweak it with this threshold value to check which value has the least misclassification for the minority class. We can determine the optimum threshold value using the AUC-ROC curve. 
    • Modify an existing algorithm to be more sensitive to rare classes: All boosting algorithms are based on this principle, We can use XGBoost, LightGBM, Adaboost, etc that use a series of weak classifiers, giving more weight to misclassified values in each iteration. 
    • Adjust the class weight (misclassification costs): Providing more weight ratio to the minority class in the loss function. 
>>> from sklearn.linear_model import LogisticRegression
>>>clf = LogisticRegression(class_weight={0:1,1:10})

In logistic regression, the loss function is binary-cross entropy:

Loss = −ylog(p) − (1−y)log(1−p)

Now, the modified loss becomes:

-20*ylog(p) -1*(1-y)log(1-p)

So what exactly happens here?

  • If the model gives a probability of 0.3 and we misclassify a positive ex, the NewLoss acquires a value of -20log(0.3) = 10.45
  • If our model gives a probability of 0.7 and we misclassify a negative ex, the NewLoss acquires a value of -log(0.3) = 0.52

That means we penalize our model around twenty times more when it misclassifies a positive minority example, as it does in this case.

Metrics for Imbalance Data

The performance of machine learning algorithms is typically evaluated using predictive accuracy and confusion matrix (classification problem). However, accuracy is not appropriate when the data is imbalanced and/or the costs of different errors vary markedly. 

confusion matrix

That’s why we choose a confusion matrix to analyse the misclassified minority values.

Conclusion

The points discussed in this blog can be applied to all imbalanced datasets and are good practices to follow otherwise also :

  1. Don’t use accuracy (or error rate) to evaluate your classifier! There are two significant problems with it. Accuracy applies a naive 0.50 threshold to decide between classes, and this is usually wrong when the classes are imbalanced. Second, classification accuracy is based on a simple count of the errors, and you should use other metrics before reaching a conclusion. 
  2. Visualize the classifier performance using a ROC curve, a precision-recall curve, a lift curve, or a profit (gain) curve.
  3. Don’t get hard classifications (labels) from your classifier. Instead, get probability estimates via proba or predict_proba (python scikit learn library).
  4. When you get probability estimates, don’t blindly use a 0.50 decision threshold to separate classes. Look at performance curves and decide for yourself what threshold to use. Many errors were made in early papers because researchers naively used 0.5 as a cut-off.
  5. No matter what you do for training, always test on the natural (stratified) distribution your classifier is going to operate upon. See sklearn.cross_validation.StratifiedKFold.
  6. You can get by without probability estimates, but if you need them, use calibration (see sklearn.calibration.CalibratedClassifierCV)