A Multi-Dimensional Undersampling Technique for Imbalanced Data
Abstract:
Class imbalance is a common issue in many real-world classification problems, leading to decreased model performance and biased results. To address this problem, we propose a novel undersampling technique called Quantile-Based Representative Subsampling (QBRS). The QBRS algorithm selects a representative subset from the original imbalanced dataset by dividing the data into quantiles across multiple dimensions. This ensures that the final subset captures the overall distribution of the original data across the chosen dimensions. The effectiveness of the proposed algorithm is demonstrated through an empirical evaluation on a benchmark dataset and compared against six popular undersampling techniques. Results show that QBRS achieves competitive performance in terms of accuracy, precision, recall, and F1 score, while using fewer samples for training.
Introduction
Class imbalance is a prevalent issue in machine learning, especially in classification problems where one class has significantly fewer samples than the others. This imbalance can lead to biased models that perform poorly on minority classes. Several methods have been proposed to address this problem, including oversampling, undersampling, and cost-sensitive learning. In this paper, we focus on the undersampling techniques, which aim to balance the class distribution by selecting a subset of the majority class instances.
Related Work
Existing undersampling techniques include random undersampling, Tomek Links, Condensed Nearest Neighbor, and neighborhood cleaning rule, among others. While these techniques have been proven effective in various applications, they have limitations, including loss of potentially valuable information and sensitivity to the choice of nearest neighbor algorithms.
Quantile-Based Representative Subsampling (QBRS)
We introduce QBRS, an algorithm that selects a representative subset from the original imbalanced dataset by dividing the data into quantiles across multiple dimensions. QBRS ensures that the final subset captures the overall distribution of the original data across the chosen dimensions.
3.1 Algorithm
The QBRS algorithm involves the following steps:
-
Define the number of samples in the final subset (n).
-
Calculate the required number of quantiles (q) for each dimension (d), such that q^d is approximately equal to the desired number of samples. In this example, q=⌊n^1/d⌋.
-
For each dimension, divide the data into q quantiles.
-
Create all possible combinations of quantile boundaries for each dimension.
-
For each combination, find the sample from the superset that is closest to the combination’s boundary values across all dimensions (e.g., using Euclidean distance or another appropriate metric).
-
Add the sample found in step 5 to the subset, and remove it from the superset to avoid selecting it again.
-
Repeat steps 5-6 for all combinations.
Experiments
To evaluate the performance of QBRS, we conducted experiments on a benchmark dataset (Iris dataset) and compared the results with six popular undersampling techniques. We used a multi-layer perceptron (MLP) classifier as the base classifier and measured the performance in terms of accuracy, precision, recall, and F1 score.
Results and Discussion
Our experimental results show that QBRS achieves competitive performance compared to other undersampling techniques while using fewer samples for training. QBRS demonstrates improved accuracy, precision, recall, and F1 score compared to some popular techniques, highlighting its potential as an effective undersampling method for imbalanced data.
Conclusion
In this paper, we introduced a novel undersampling technique called Quantile-Based Representative Subsampling (QBRS) for addressing class imbalance in multi-dimensional data. The proposed algorithm selects a representative subset by dividing the data into quantiles across multiple dimensions, ensuring that the final subset captures the overall distribution of the original data across the chosen dimensions. Our empirical evaluation demonstrates the effectiveness of QBRS on a benchmark dataset, achieving competitive performance compared to other popular undersampling techniques.
Future work may include extending the QBRS algorithm to handle categorical features and exploring the combination of QBRS with oversampling techniques to further improve the performance on imbalanced datasets. Additionally, investigating the impact of different distance metrics in the QBRS algorithm could lead to more accurate and robust representative subset selection.
Acknowledgments
We would like to thank the researchers and developers who have contributed to the field of imbalanced learning and undersampling techniques. Their work has laid the foundation for the development of the QBRS algorithm and has provided valuable insights for its implementation.
Appendices
Appendix A: QBRS Algorithm Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import cdist
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import (NearMiss, CondensedNearestNeighbour,
TomekLinks, EditedNearestNeighbours,
OneSidedSelection, NeighbourhoodCleaningRule)
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
import warnings
warnings.filterwarnings('ignore')
# QBRS undersampler
class QBRS:
def __init__(self, n=None):
self.n = n
def _calculate_quantiles(self, X, q):
quantiles = []
for i in range(X.shape[1]):
quantiles.append(np.percentile(X[:, i], np.linspace(0, 100, q + 1)))
return quantiles
def _find_closest_sample(self, combination, X, quantiles):
distances = cdist([combination], X, metric='euclidean')
return np.argmin(distances)
def fit_resample(self, X, y):
n = self.n if self.n is not None else \
min({l: y.tolist().count(l) for l in set(y)}.values())
d = X.shape[1]
q = int(np.floor(n**(1/d)))
quantiles = self._calculate_quantiles(X, q)
combinations = np.array(np.meshgrid(*quantiles)).T.reshape(-1, d)
subset_indices = []
for combination in combinations:
idx = self._find_closest_sample(combination, X, quantiles)
if idx not in subset_indices:
subset_indices.append(idx)
X_sub = X[subset_indices]
y_sub = y[subset_indices]
return X_sub, y_sub
Appendix B: Benchmark Dataset and Evaluation
# Load example dataset
data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)
undersampling_methods = {
'QBRS': QBRS(),
'NearMiss': NearMiss(version=1),
'CondensedNN': CondensedNearestNeighbour(random_state=RANDOM_STATE),
'TomekLinks': TomekLinks(),
'EditedNN': EditedNearestNeighbours(),
'OneSidedSelection': OneSidedSelection(random_state=RANDOM_STATE),
'NeighborhoodCleaningRule': NeighbourhoodCleaningRule()
}
results = []
for name, method in undersampling_methods.items():
X_resampled, y_resampled = method.fit_resample(X_train, y_train)
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', MLPClassifier((4, 3), random_state=RANDOM_STATE))
])
# Train and evaluate the model
pipeline.fit(X_resampled, y_resampled)
y_pred = pipeline.predict(X_test)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)
results.append({
'method': name,
'samples': len(X_resampled),
'class_distribution': dict(pd.Series(y_resampled).value_counts()),
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1_score': f1,
'confusion_matrix': cm
})
# Display results
results_df = pd.DataFrame(results)
print(results_df[['method', 'samples', 'class_distribution', 'accuracy', 'precision', 'recall', 'f1_score']])
# Plot confusion matrices
fig, axes = plt.subplots(2, 4, figsize=(15, 8))
axes = axes.flatten()
for i, result in enumerate(results):
sns.heatmap(result['confusion_matrix'], annot=True, fmt='d', cmap='Blues', ax=axes[i], cbar=False)
axes[i].set_title(f"{result['method']} (samples: {result['samples']})")
# Remove extra subplots
for j in range(i+1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
Appendix C: Evaluation Results Table
Method | Samples | Class Distribution | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|---|
QBRS | 32 | {1: 13, 2: 10, 0: 9} | 0.900000 | 0.902778 | 0.895623 | 0.897698 |
NeighborhoodCleaningRule | 117 | {0: 40, 2: 39, 1: 38} | 0.800000 | 0.855556 | 0.777778 | 0.751748 |
TomekLinks | 118 | {0: 40, 1: 39, 2: 39} | 0.800000 | 0.855556 | 0.777778 | 0.751748 |
NearMiss | 117 | {0: 39, 1: 39, 2: 39} | 0.800000 | 0.855556 | 0.777778 | 0.751748 |
EditedNN | 115 | {0: 40, 2: 39, 1: 36} | 0.766667 | 0.840278 | 0.740741 | 0.695847 |
OneSidedSelection | 61 | {2: 39, 1: 21, 0: 1} | 0.400000 | 0.223285 | 0.370370 | 0.277778 |
CondensedNN | 48 | {2: 39, 1: 8, 0: 1} | 0.366667 | 0.122222 | 0.333333 | 0.178862 |