SMOTE: Synthetic Minority Oversampling Technique

SMOTE is a technique that is used to resolve imbalance dataset problems. It focuses on increasing the minority samples in Imbalanced data to achieve a robust classifier.

Imbalance Data

Often real-world datasets are predominantly composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Many examples of practical imbalanced datasets are in domains such: fraudulent telephone calls, telecommunications management,  text classification, and detection of oil spills in satellite images.

 Minority class in Red and  Majority in Green

Before SMOTE 

There were many techniques that were used prior to SMOTE, some of the famous one’s are here  : 

  1. Under-sampling of the majority (normal) class & Under-sampling of the minority class has been proposed as a good means of increasing the sensitivity of a classifier towards the minority class. The classifier was then trained on this new data.
Source
  1. More weight to minority class: Most of the machine learning models provide a parameter called class_weights. For example, in a Logistic Regression using class_weights we can specify a higher weight for the minority class using a dictionary.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(class_weight={0:1,1:20})

In logistic regression, the loss function is binary-cross entropy::

Loss = −ylog(p) − (1−y)log(1−p)

Now, the modified loss becomes:

-20*ylog(p) -1*(1-y)log(1-p)

So what happens exactly here?

  • If the model gives a probability of 0.3 and we misclassify a positive ex, the NewLoss acquires a value of -20log(0.3) = 10.45
  • If the model gives a probability of 0.7 and we misclassify a negative ex, the NewLoss acquires a value of -log(0.3) = 0.52

That means we penalize our model around twenty times more when it misclassifies a positive minority example in this case.

What is SMOTE ?

A group of 4 Professors Nitesh V. Chawla, Kevin W., Lawrence O. Hall, W. Philip Kegelmeyer came together and published their research paper on the Synthetic Minority Oversampling technique in 2002. In SMOTE, the minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors.

[Image Source]

 It’s implementation currently uses five nearest neighbors. For instance, if the amount of over-sampling needed is 200%, only two neighbors from the five nearest neighbors are chosen and one sample is generated in the direction of each. Synthetic samples are generated in the following way: Take the difference between the feature vector (sample) under consideration and its nearest neighbor. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general.

Why  SMOTE ?

It proved to be better than other methods of handling imbalanced datasets. It significantly improved the recall of minority classes. It was able to classify minority classes better than other methods.  It can achieve better results with fewer nodes in decision trees with SMOTE which means a more generalized boundary between classes.  Over the years, It has been preferred over other methods.

When not to use SMOTE ?

SMOTE is usually not preferred on multi-label classification problems until the decision boundary is clear. It synthetically adds values along the line joining the k – Nearest Neighbours. If data points of different classes are close to each other,  It leads to the intersection of planes along with the k-Nearest neighbor which might result in noisy data.  Otherwise, the synthetically generated samples might belong to a different class. 

Conclusion

  • SMOTE is the preferred technique when it comes to binary classification in Imbalanced Data. 
  • Ideally you should collect more data on such  business problems. 
  • Once you use SMOTE, you also consider doing anomaly detection. It finds rare items, events or observations which raise suspicions by differing significantly from the majority of the data. This helps us remove those noises further.