SMOTE resampling
A much stronger approach has been proposed by Chawla et al. (in SMOTE: Synthetic Minority Over-sampling Technique, Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P., Journal of Artificial Intelligence Research, 16/2002). The algorithm is called Synthetic Minority Over-sampling Technique (SMOTE) and, contrary to the previous one, has been designed to generate new samples that are coherent with the minor class distribution. A full description of the algorithm is beyond the scope of this book (it can be found in the aforementioned paper), however, the main idea is to consider the relationships that exist between samples and create new synthetic points along the segments connecting a group of neighbors. Let's consider the following diagram:
The three points (x1, x2, x3) belong to a minor class and are members of the same neighborhood (if the reader is not familiar with this concept, he/she can think of a group of points whose mutual distances are below a fixed threshold). SMOTE can upsample the class by generating the sample x1u and x2u and placing them on the segments, connecting the original samples. This procedure can be better understood by assuming that the properties of the samples are not changing below a certain neighborhood radius, hence it's possible to create synthetic variants that belong to the same original distribution. However, contrary to resampling with replacement, the new dataset has a higher variance, and a generic classifier can better find a suitable separation hypersurface.
In order to show how SMOTE works, we are going to employ a scikit-learn extension called imbalanced-learn (see the box at the end of this section), which implements many algorithms to manage this kind of problem. The balanced dataset (based on the one we previously generated) can be obtained by using an instance of the SMOTE class:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=1000)
X_resampled, Y_resampled = smote.fit_sample(X, Y)
The fit_sample method analyzes the original dataset and generates the synthetic samples from the minor class automatically. The most important parameters are as follows:
- ratio (default is 'auto'): It determines which class must be resampled (acceptable values are 'minority', 'majority', 'all', and 'not minority'). The meaning of each alternative is intuitive, but in general, we work by upsampling the minority class or, more seldom, by resampling (and balancing) the whole dataset.
- k_neighbors (default is 5): The number of neighbors to consider. Larger values yield more dense resamplings, and therefore I invite the reader to repeat this process by using k_neighbors equal to 2, 10, and 20, and compare the results. Remember that the underlying geometric structure is normally based on Euclidean distances, hence blobs are generally preferable to wireframe datasets. The value 5 is often a good trade-off between this condition and the freedom according to SMOTE in the generation process.
We can better understand this behavior by observing the following graph (we have upsampled the minority class with 5 neighbors):
As it's possible to see, the original dataset only has a few points belonging to the class 2 and they are all in the upper part of the graph. A resampling with replacement is able to increase the number of samples, but the resultant graph would be exactly the same, since the values are always taken from the existing set. On the other hand, SMOTE has generated the same number of samples by considering the neighborhoods (in this case, there's also an overlap in the original dataset). The final result is clearly acceptable and consistent with the data generating process. Moreover, it can help a classifier in finding out the optimal separating curve, which will probably be more centered (it could be a horizontal line passing through x1=0) than the one associated with the unbalanced dataset.