Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

Mohamed Abdelhamid; Abhyuday Desai

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

Mohamed Abdelhamid, Abhyuday Desai

TL;DR

This study tackles the persistent challenge of class imbalance in binary classification by empirically comparing three prevalent strategies—SMOTE, Class Weights, and Decision Threshold Calibration—against a Baseline across 30 diverse datasets and 15 models with nested 5-fold cross-validation. The primary metric is the $F1$-score, supplemented by nine other metrics to capture calibration and minority-class performance. Results show that all three imbalance-handling methods outperform Baseline, with Decision Threshold Calibration providing the most consistent gains, though improvements vary across datasets and models. Notably, SMOTE improves minority detection but can degrade probability calibration, while Threshold Calibration maintains calibration and often yields higher $F1$ and $F2$ scores. The findings advocate dataset- and model-aware testing of imbalance strategies, highlighting Decision Threshold Calibration as a practical default, yet underscoring the value of exploring multiple approaches for specific problems.

Abstract

Class imbalance in binary classification tasks remains a significant challenge in machine learning, often resulting in poor performance on minority classes. This study comprehensively evaluates three widely-used strategies for handling class imbalance: Synthetic Minority Over-sampling Technique (SMOTE), Class Weights tuning, and Decision Threshold Calibration. We compare these methods against a baseline scenario of no-intervention across 15 diverse machine learning models and 30 datasets from various domains, conducting a total of 9,000 experiments. Performance was primarily assessed using the F1-score, although our study also tracked results on additional 9 metrics including F2-score, precision, recall, Brier-score, PR-AUC, and AUC. Our results indicate that all three strategies generally outperform the baseline, with Decision Threshold Calibration emerging as the most consistently effective technique. However, we observed substantial variability in the best-performing method across datasets, highlighting the importance of testing multiple approaches for specific problems. This study provides valuable insights for practitioners dealing with imbalanced datasets and emphasizes the need for dataset-specific analysis in evaluating class imbalance handling techniques.

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

TL;DR

-score, supplemented by nine other metrics to capture calibration and minority-class performance. Results show that all three imbalance-handling methods outperform Baseline, with Decision Threshold Calibration providing the most consistent gains, though improvements vary across datasets and models. Notably, SMOTE improves minority detection but can degrade probability calibration, while Threshold Calibration maintains calibration and often yields higher

and

scores. The findings advocate dataset- and model-aware testing of imbalance strategies, highlighting Decision Threshold Calibration as a practical default, yet underscoring the value of exploring multiple approaches for specific problems.

Abstract

Paper Structure (21 sections, 4 tables)

This paper contains 21 sections, 4 tables.

Introduction
Related Works
Oversampling Techniques
Decision Threshold
Class Weights
Gaps in the Literature and Motivation for This Study
Methodology
Datasets
Models
Evaluation Metrics
Experimental Procedure
Scenario Descriptions
Hyperparameter Tuning
Results
Overall Comparison
...and 6 more sections

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

TL;DR

Abstract

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

Authors

TL;DR

Abstract

Table of Contents