Statistical Undersampling with Mutual Information and Support Points
Alex Mak, Shubham Sahoo, Shivani Pandey, Yidan Yue, Linglong Kong
TL;DR
Class imbalance often biases classifiers toward the majority class. The paper introduces two undersampling strategies—MI-based stratified simple random sampling and a support points optimization method—that aim to preserve information and distributional structure in the majority class. The MI approach uses pairwise mutual information to form strata and guide stratified sampling, while the support points method minimizes the energy distance to maintain fidelity to the original distribution. Empirical results on a small breast cancer dataset and a large credit-card fraud dataset show that MI-based undersampling yields substantial gains over naive downsampling in small-data settings, and support points achieve strong representativeness with competitive performance on large-scale data, underscoring the value of integrating statistical sampling principles with machine learning for imbalanced classification.
Abstract
Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.
