[ Back to Basics - 01 ] Synthetic Minority Over-sampling Technique (SMOTE)

3 minute read


First article in 2019.

In this post I’m going to write about one of the most popular sampling techniques in machine learning - that is SMOTE. Hope I can explain the concept in simple language.

Note: I presume that the readers of this post are quite familiar with the concept of imbalanced data.

Sampling in machine learning is used to overcome the problem of imbalanced data. Specifically, sampling has two general categories, namely under-sampling and over-sampling. In under-sampling, we make the amount of majority class to be similar to the amount of minority class. The simplest approach to perform under-sampling would be by randomly selecting instances from the majority class. Nevertheless, the most obvious drawback from this approach is we might lose important or specific information which in fact build the pattern of the majority class.

On the other hand, in over-sampling we augment the number of minority instances with the result that its amount is quite similar to the amount of samples in the majority class. The simplest approach would be by replicating the samples in the minority class. Replicating the samples might result in overfitting.

Synthetic Minority Over-sampling Technique (SMOTE) is classified as over-sampling technique. It’s been one of the most popular over-sampling algorithms due to its implementation simplicity. Because of the same reason, many other over-sampling techniques are built based on this SMOTE algorithm. One of the examples is SMOTE-borderline. I’m gonna write about this SMOTE’s variant in the next article.

Now, let’s take a deep dive into SMOTE’s algorithm.

For each minority sample (ms), do the followings:

  1. Find k nearest minority samples. For instance, k = 3 would yield result as in the below figure. The figure shows that sample ms = A has sample B, C, D as the first, second, and third nearest sample respectively.

k-nearest neighbors of sample A

Fig. 1. k-nearest neighbors of sample A

  1. Based on the amount of oversampling N (e.g. N = 2 means that you want to synthesize 2 new samples), select N samples from the k samples randomly. For N = 2, the result is shown in the figure below. The selection of sample B and D was performed randomly.

Take N random samples from the k-nearest neighbors

Fig. 2. Take N random samples from the k-nearest neighbors

  1. For each sample in N (let’s say n), generate the new sample between ms and n using the following steps:

a. Compute the distance of feature vector between ms and n. Let’s store the result in diffFV(ms, n). The result of this step should be in the form of a feature vector. Based on the Fig. 2, the computation of feature vector distance is:

diffFV(A, B) = feature_vec(B) - feature_vec(A)
diffFV(A, D) = feature_vec(D) - feature_vec(A)
diffFV(A, B) = feature_vec(B) - feature_vec(A)
diffFV(A, D) = feature_vec(D) - feature_vec(A)

b. Multiply the diffFV(ms, n) with a random number between 0 and 1. This ensures that the new sample lies between ms and n. Let’s assume that the output of this step is diffFV_mul(ms, n)

diffFV_mul(A, B) = diffFV(A, B) * rand(0, 1)
diffFV_mul(A, D) = diffFV(A, D) * rand(0, 1)

c. Add diffFV_mul(ms, n) to ms

new_sample_between(A, B) = feature_vec(A) + diffFV_mul(A, B)
new_sample_between(A, D) = feature_vec(A) + diffFV_mul(A, D)

d. Done

Although this article only shows the high-level overview of SMOTE, I hope this brief review may help those who’ve been trying to grasp the concept of SMOTE.

Feel free to comment if you found any irrelevant or missing information.