SMOTE for Imbalanced Classification with Python

When a dataset has more samples of one class and very few of another, the model tends to predict the majority class more often. This problem is called class imbalance. The Synthetic Minority Over-sampling Technique (SMOTE) helps fix this issue by creating new synthetic samples for the smaller (minority) class. This makes the dataset more balanced and helps the model learn both classes properly.

SMOTE creates new synthetic data instead of copying existing samples.
It improves accuracy for the minority class.
Variants like ADASYN, Borderline SMOTE, SMOTE-ENN and SMOTE-TOMEK make SMOTE even more effective.
It can be easily used with the Python library imbalanced-learn (imblearn).

Synthetic Minority Over-Sampling Technique (SMOTE)

SMOTE is a data-level resampling technique that generates synthetics (artificial) samples for the minority class. Instead of simply duplicating existing examples, it creates new data points by interpolating between existing ones. This approach allows the model to learn broader patterns and reduces the risk of overfiting to repeated samples.

Working:

Identify the Minority Class: The process begins by detecting which class (or classes) have significantly fewer samples compared to others
Find Nearest Neighbors: For each sample in the minority class, SMOTE locates its k nearest neighbors (based on distance in the feature space). The value of k is a user-defined parameter that controls how many neighbors are considered.
Generate Synthetic Samples: A random neighbor from these k nearest points is chosen and a new synthetic instance is created along the line segment connecting the original sample the the chosen neighbor. This ensures the new points are realistic yet distinct.
Control Oversampling Amount: The number of synthetic samples to be generated is determined by an oversampling ratio, which is chosen so that both classes reach a similar size or desired balance.
Repeat for All Minority Samples: Steps 2 to 4 are repeated for all minority class examples to produce enough synthetic data for balancing
Form the Final Balanced Dataset: After generating these synthetic examples, the dataset becomes more balanced, helping machine learning models train more effectively and fairly across all classes

Step By Step Implementation

Here in this code we handles class imbalance in a credit card fraud dataset by applying SMOTE oversampling trains a logistic regression model and evaluates its performance using accuracy, classification report and confusion matrix.

Step 1: Import Required Libraries

Import Pandas for handling CSV files and working with DataFrames.
Import Matplotlib for basic data visualization and plotting graphs.
Import SMOTE to handle class imbalance by oversampling the minority class.
Import Seaborn for enhanced statistical data visualiation

python

Step 2: Load and Explore the Dataset

Load the credit card fraud dataset from CSV
Separate features (X) and target variable (y)
Handle missing values using median imputation
Visualize the original class imbalance

python

Output:

Figure 1. Original Class Distribution in dataset

Step 3: Train Logistic Regression

Split data into training and testing sets using stratification
Train a Logistic Regression model on imbalanced data
Predict test results and evaluate performance

Step 4: Apply SMOTE to Balance Classes

Apply SMOTE to oversample the minority (fraud) class
Generate a balanced training dataset
Visualize the new class distribution after SMOTE.

python

Output:

Figure 3. Class Distribution AFTER SMOTE

Step 5: Train Logistic Regression After SMOTE

Retrain the model using the balanced dataset
Predict results on the same test set
Compare accuracy and classification performance

python

Output:

After applying SMOTE the model becomes better at detecting the minority class but at the cost of increased false positives highlighting the trade-off between recall and precision in imbalanced datasets.

After SMOTE the model no longer favors the majority class so overall accuracy decreases but this reflects a more balanced learning process.
Recall improves but precision collapses for minority class Class 1 recall rises sharply meaning most minority samples are detected but very low precision shows many false positives.
Confusion matrix shows over-prediction of minority class A large number of majority samples are misclassified as class 1 indicating the model became too sensitive after SMOTE.

Variants of SMOTE

SMOTE effectively addresses data imbalance by generating synthetic samples, enriching the minority class and refining decision boundaries. Despite its benefits, SMOTE's computational demands can escalate with larger datasets and high-dimensional feature spaces. To enhance SMOTE's capability to handle various data scenarios, several extensions have been developed:

1. ADASYN (Adaptive Synthetic Sampling)

ADASYN stands for Adaptive Synthetic Sampling. It is an improved version of SMOTE that automatically focuses more on minority samples that are difficult to learn. Instead of generating synthetic samples uniformly, ADASYN creates more new samples for minority points that are near the decision boundary where the model usually makes more mistakes.

Working:

Calculate the level of difficulty for each minority sample. Here samples surrounded by majority samples are considered harder to learn.
Assign higher weights to difficult samples so that more synthetic examples are created around them.
Generate synthetic samples by interpolating between each difficult sample and its nearest minority neighbors.
The final dataset has more new samples near the boundary, improving the model’s ability to classify challenging regions.

Implementation:

python

Output:

2. Borderline SMOTE

Borderline SMOTE is a modified version of SMOTE that focuses only on minority samples that lie near the boundary between classes. These are the samples most likely to be misclassified, so generating synthetic samples around them helps strengthen the classifier’s performance near decision boundaries.

Working:

Identify minority samples that have many majority samples as their nearest neighbors and these are called borderline samples.
Generate synthetic samples only around these borderline points, avoiding areas deep inside the majority class.
This keeps the generated data clean and helps the model learn class boundaries more accurately.

Implementation:

python

Output:

3. SMOTE-ENN (Edited Nearest Neighbors)

SMOTE-ENN combines two techniques i.e SMOTE for oversampling and Edited Nearest Neighbors (ENN) for cleaning. First, SMOTE generates synthetic data to balance the dataset. Then, ENN removes noisy or misclassified points from both classes to make the dataset cleaner and more reliable.

Working:

Apply SMOTE to oversample the minority class.
For each sample, look at its nearest neighbors.
If most neighbors belong to a different class, remove that sample beacuse it’s likely a noise.
The result is a balanced and denoised dataset that helps improve model performance and generalization.

Implementation:

python

Output:

4. SMOTE-TOMEK (Hybrid Method)

SMOTE-TOMEK is a hybrid resampling technique that combines SMOTE and Tomek Links. After oversampling with SMOTE, Tomek Links are identified and removed to eliminate overlapping or borderline points between classes.

Working:

SMOTE first oversamples the minority class by creating synthetic examples.
Find Tomek Links as these are pairs of samples from opposite classes that are each other’s nearest neighbors.
Remove those pairs, as they often lie in overlapping regions that confuse the model.
The final dataset becomes balanced and cleaner, improving the separation between classes.

Implementation:

python

5. SMOTE-NC (Nominal Continuous)

SMOTE-NC (Synthetic Minority Over-sampling Technique for Nominal and Continuous features) is a version of SMOTE designed for datasets that contain both numerical and categorical variables. Traditional SMOTE works by interpolating between numeric features, but it fails when applied directly to categorical data because we can’t interpolate between category labels. SMOTE-NC fixes this by treating categorical and continuous features differently during the generation of synthetic samples.

Working:

Identify which features in the dataset are categorical and which are continuous.
For continuous features, generate synthetic samples using interpolation just like standard SMOTE.
For categorical features, assign the most frequent category among the nearest neighbors instead of interpolating.
Combine the newly generated synthetic samples to create a balanced dataset that respects both numeric and categorical data integrity.

python

When to Use Each SMOTE Variant

Let's discuss when to use each variant:

SMOTE	Best use case	Main strength	When to use/ key notes
SMOTE (Standard)	Moderately imbalanced continuous datasets	Balances classes using synthetic interpolated samples.	Use for numeric, low-noise datasets works as a general solution
ADASYN (Adaptive SMOTE)	Datasets with region-wise imbalance	Generates adaptive synthetic data for hard-to-learn samples	Use for hard to classify minority regions to improve boundary learning
Borderline SMOTE	Minority samples close to class boundaries.	Generates samples near decision boundaries to reduce misclassification	Use when classes overlap or boundaries are frequently confused
SMOTE-ENN (Hybrid)	Noisy datasets with misclassified or ambiguous samples	Combines SMOTE and ENN to oversample and clean noisy instances	Use when the dataset has noise or outliers and we want a cleaner, balanced dataset.
SMOTE-TOMEK (Hybrid)	Datasets with overlapping classes needing clearer separation	Removes Tomek links post-SMOTE to reduce overlap and improve separation	Use when we want to improve boundary clarity after oversampling.
SMOTE-NC (Nominal Continuous)	Datasets with both categorical and continuous features.	Handles mixed features via numeric interpolation and categorical mode assignment	Use for datasets with categorical columns not suited for purely numeric data