Quantile-Based Representative Subsampling

March 16, 2023
6 minutes

A Multi-Dimensional Undersampling Technique for Imbalanced Data

Confusion Matrix comparing sampling methods

Abstract:

Class imbalance is a common issue in many real-world classification problems, leading to decreased model performance and biased results. To address this problem, we propose a novel undersampling technique called Quantile-Based Representative Subsampling (QBRS). The QBRS algorithm selects a representative subset from the original imbalanced dataset by dividing the data into quantiles across multiple dimensions. This ensures that the final subset captures the overall distribution of the original data across the chosen dimensions. The effectiveness of the proposed algorithm is demonstrated through an empirical evaluation on a benchmark dataset and compared against six popular undersampling techniques. Results show that QBRS achieves competitive performance in terms of accuracy, precision, recall, and F1 score, while using fewer samples for training.

Introduction

Class imbalance is a prevalent issue in machine learning, especially in classification problems where one class has significantly fewer samples than the others. This imbalance can lead to biased models that perform poorly on minority classes. Several methods have been proposed to address this problem, including oversampling, undersampling, and cost-sensitive learning. In this paper, we focus on the undersampling techniques, which aim to balance the class distribution by selecting a subset of the majority class instances.

Existing undersampling techniques include random undersampling, Tomek Links, Condensed Nearest Neighbor, and neighborhood cleaning rule, among others. While these techniques have been proven effective in various applications, they have limitations, including loss of potentially valuable information and sensitivity to the choice of nearest neighbor algorithms.

Quantile-Based Representative Subsampling (QBRS)

We introduce QBRS, an algorithm that selects a representative subset from the original imbalanced dataset by dividing the data into quantiles across multiple dimensions. QBRS ensures that the final subset captures the overall distribution of the original data across the chosen dimensions.

3.1 Algorithm

The QBRS algorithm involves the following steps:

  1. Define the number of samples in the final subset (n).

  2. Calculate the required number of quantiles (q) for each dimension (d), such that q^d is approximately equal to the desired number of samples. In this example, q=⌊n^1/d⌋.

  3. For each dimension, divide the data into q quantiles.

  4. Create all possible combinations of quantile boundaries for each dimension.

  5. For each combination, find the sample from the superset that is closest to the combination’s boundary values across all dimensions (e.g., using Euclidean distance or another appropriate metric).

  6. Add the sample found in step 5 to the subset, and remove it from the superset to avoid selecting it again.

  7. Repeat steps 5-6 for all combinations.

Experiments

To evaluate the performance of QBRS, we conducted experiments on a benchmark dataset (Iris dataset) and compared the results with six popular undersampling techniques. We used a multi-layer perceptron (MLP) classifier as the base classifier and measured the performance in terms of accuracy, precision, recall, and F1 score.

Results and Discussion

Our experimental results show that QBRS achieves competitive performance compared to other undersampling techniques while using fewer samples for training. QBRS demonstrates improved accuracy, precision, recall, and F1 score compared to some popular techniques, highlighting its potential as an effective undersampling method for imbalanced data.

Conclusion

In this paper, we introduced a novel undersampling technique called Quantile-Based Representative Subsampling (QBRS) for addressing class imbalance in multi-dimensional data. The proposed algorithm selects a representative subset by dividing the data into quantiles across multiple dimensions, ensuring that the final subset captures the overall distribution of the original data across the chosen dimensions. Our empirical evaluation demonstrates the effectiveness of QBRS on a benchmark dataset, achieving competitive performance compared to other popular undersampling techniques.

Future work may include extending the QBRS algorithm to handle categorical features and exploring the combination of QBRS with oversampling techniques to further improve the performance on imbalanced datasets. Additionally, investigating the impact of different distance metrics in the QBRS algorithm could lead to more accurate and robust representative subset selection.

Acknowledgments

We would like to thank the researchers and developers who have contributed to the field of imbalanced learning and undersampling techniques. Their work has laid the foundation for the development of the QBRS algorithm and has provided valuable insights for its implementation.

Appendices

Appendix A: QBRS Algorithm Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import cdist
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import (NearMiss, CondensedNearestNeighbour,
                                     TomekLinks, EditedNearestNeighbours,
                                     OneSidedSelection, NeighbourhoodCleaningRule)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

import warnings
warnings.filterwarnings('ignore')

# QBRS undersampler
class QBRS:
    def __init__(self, n=None):
        self.n = n
    
    def _calculate_quantiles(self, X, q):
        quantiles = []
        for i in range(X.shape[1]):
            quantiles.append(np.percentile(X[:, i], np.linspace(0, 100, q + 1)))
        return quantiles

    def _find_closest_sample(self, combination, X, quantiles):
        distances = cdist([combination], X, metric='euclidean')
        return np.argmin(distances)
    
    def fit_resample(self, X, y):
        n = self.n if self.n is not None else \
            min({l: y.tolist().count(l) for l in set(y)}.values()) 
        d = X.shape[1]
        q = int(np.floor(n**(1/d)))
        quantiles = self._calculate_quantiles(X, q)

        combinations = np.array(np.meshgrid(*quantiles)).T.reshape(-1, d)
        subset_indices = []

        for combination in combinations:
            idx = self._find_closest_sample(combination, X, quantiles)
            if idx not in subset_indices:
                subset_indices.append(idx)

        X_sub = X[subset_indices]
        y_sub = y[subset_indices]

        return X_sub, y_sub

Appendix B: Benchmark Dataset and Evaluation


# Load example dataset
data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

undersampling_methods = {
    'QBRS': QBRS(),
    'NearMiss': NearMiss(version=1),
    'CondensedNN': CondensedNearestNeighbour(random_state=RANDOM_STATE),
    'TomekLinks': TomekLinks(),
    'EditedNN': EditedNearestNeighbours(),
    'OneSidedSelection': OneSidedSelection(random_state=RANDOM_STATE),
    'NeighborhoodCleaningRule': NeighbourhoodCleaningRule()
}

results = []

for name, method in undersampling_methods.items():
    X_resampled, y_resampled = method.fit_resample(X_train, y_train)
    
    # Create pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', MLPClassifier((4, 3), random_state=RANDOM_STATE))
    ])

    # Train and evaluate the model
    pipeline.fit(X_resampled, y_resampled)
    y_pred = pipeline.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    f1 = f1_score(y_test, y_pred, average='macro')
    
    # Calculate confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    results.append({
        'method': name,
        'samples': len(X_resampled),
        'class_distribution': dict(pd.Series(y_resampled).value_counts()),
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'confusion_matrix': cm
    })

# Display results
results_df = pd.DataFrame(results)
print(results_df[['method', 'samples', 'class_distribution', 'accuracy', 'precision', 'recall', 'f1_score']])

# Plot confusion matrices
fig, axes = plt.subplots(2, 4, figsize=(15, 8))
axes = axes.flatten()

for i, result in enumerate(results):
    sns.heatmap(result['confusion_matrix'], annot=True, fmt='d', cmap='Blues', ax=axes[i], cbar=False)
    axes[i].set_title(f"{result['method']} (samples: {result['samples']})")

# Remove extra subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

Appendix C: Evaluation Results Table

Method Samples Class Distribution Accuracy Precision Recall F1 Score
QBRS 32 {1: 13, 2: 10, 0: 9} 0.900000 0.902778 0.895623 0.897698
NeighborhoodCleaningRule 117 {0: 40, 2: 39, 1: 38} 0.800000 0.855556 0.777778 0.751748
TomekLinks 118 {0: 40, 1: 39, 2: 39} 0.800000 0.855556 0.777778 0.751748
NearMiss 117 {0: 39, 1: 39, 2: 39} 0.800000 0.855556 0.777778 0.751748
EditedNN 115 {0: 40, 2: 39, 1: 36} 0.766667 0.840278 0.740741 0.695847
OneSidedSelection 61 {2: 39, 1: 21, 0: 1} 0.400000 0.223285 0.370370 0.277778
CondensedNN 48 {2: 39, 1: 8, 0: 1} 0.366667 0.122222 0.333333 0.178862