SMOTE: Synthetic Minority Oversampling Technique

SMOTE is a technique that is used to resolve imbalance in datasets. It focuses on increasing minority samples in imbalanced data to achieve a robust classifier.

Imbalanced Data

Often real-world datasets are predominantly composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than doing vice versa. Many examples of practical imbalanced datasets are in domains such as fraudulent telephone calls, telecommunications management,  text classification, and detection of oil spills in satellite images.

 Minority class in Red and  Majority in Green

Before SMOTE 

There were many techniques that were used prior to SMOTE, some of the famous one’s are entailed below: 

  1. Under-sampling of the majority (normal) class & Under-sampling of the minority class has been proposed as a good means of increasing the sensitivity of a classifier towards the minority class. The classifier was then trained on this new data.
Source
  1. More weight to minority class: Most of the machine learning models provide a parameter called class_weights. For example, in Logistic Regression using class_weights we can specify a higher weight for the minority class using a dictionary.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(class_weight={0:1,1:20})

In logistic regression, the loss function is binary-cross entropy::

Loss = −ylog(p) − (1−y)log(1−p)

Now, the modified loss becomes:

-20*ylog(p) -1*(1-y)log(1-p)

So what happens exactly here?

  • If the model gives a probability of 0.3 and we misclassify a positive ex, the NewLoss acquires a value of -20log(0.3) = 10.45
  • If the model gives a probability of 0.7 and we misclassify a negative ex, the NewLoss acquires a value of -log(0.3) = 0.52

That means we penalize our model around twenty times more when it misclassifies a positive minority example in this case.

What is SMOTE ?

A group of four Professors Nitesh V. Chawla, Kevin W., Lawrence O. Hall, W. Philip Kegelmeyer came together and published their research paper on the Synthetic Minority Oversampling technique in 2002. In SMOTE, the minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors.

[Image Source]

It’s implementation currently uses five nearest neighbors. For instance, if the amount of over-sampling needed is 200%, only two neighbors from the five nearest neighbors are chosen and one sample is generated in the direction of each. Synthetic samples are generated in the following way: Take the difference between the feature vector (sample) under consideration and its nearest neighbor. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general.

Why  SMOTE ?

SMOTE has proved to be better than other methods that handle imbalanced datasets because:

  • It has significantly improved the recall of minority classes.
  • It was able to classify minority classes better than other methods. 
  • It can achieve better results with fewer nodes in decision trees with SMOTE, which means a more generalized boundary between classes. 

When not to use SMOTE ?

SMOTE is usually not preferred on multi-label classification problems until the decision boundary is clear. It synthetically adds values along the line joining the k – nearest neighbours. If data points of different classes are close to each other,  it leads to the intersection of planes along with the k-nearest neighbor which might result in noisy data.  Otherwise, the synthetically generated samples might belong to a different class. 

Conclusion

  • SMOTE is the preferred technique when it comes to binary classification in imbalanced data. 
  • Ideally you should collect more data on such business problems. 
  • Once you use SMOTE, you could also consider doing anomaly detection. It finds rare items, events or observations which raise suspicions by differing significantly from the majority of the data. This helps us remove those noises further.