Table of Contents
Fetching ...

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

Mohamed Abdelhamid, Abhyuday Desai

TL;DR

This study tackles the persistent challenge of class imbalance in binary classification by empirically comparing three prevalent strategies—SMOTE, Class Weights, and Decision Threshold Calibration—against a Baseline across 30 diverse datasets and 15 models with nested 5-fold cross-validation. The primary metric is the $F1$-score, supplemented by nine other metrics to capture calibration and minority-class performance. Results show that all three imbalance-handling methods outperform Baseline, with Decision Threshold Calibration providing the most consistent gains, though improvements vary across datasets and models. Notably, SMOTE improves minority detection but can degrade probability calibration, while Threshold Calibration maintains calibration and often yields higher $F1$ and $F2$ scores. The findings advocate dataset- and model-aware testing of imbalance strategies, highlighting Decision Threshold Calibration as a practical default, yet underscoring the value of exploring multiple approaches for specific problems.

Abstract

Class imbalance in binary classification tasks remains a significant challenge in machine learning, often resulting in poor performance on minority classes. This study comprehensively evaluates three widely-used strategies for handling class imbalance: Synthetic Minority Over-sampling Technique (SMOTE), Class Weights tuning, and Decision Threshold Calibration. We compare these methods against a baseline scenario of no-intervention across 15 diverse machine learning models and 30 datasets from various domains, conducting a total of 9,000 experiments. Performance was primarily assessed using the F1-score, although our study also tracked results on additional 9 metrics including F2-score, precision, recall, Brier-score, PR-AUC, and AUC. Our results indicate that all three strategies generally outperform the baseline, with Decision Threshold Calibration emerging as the most consistently effective technique. However, we observed substantial variability in the best-performing method across datasets, highlighting the importance of testing multiple approaches for specific problems. This study provides valuable insights for practitioners dealing with imbalanced datasets and emphasizes the need for dataset-specific analysis in evaluating class imbalance handling techniques.

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

TL;DR

This study tackles the persistent challenge of class imbalance in binary classification by empirically comparing three prevalent strategies—SMOTE, Class Weights, and Decision Threshold Calibration—against a Baseline across 30 diverse datasets and 15 models with nested 5-fold cross-validation. The primary metric is the -score, supplemented by nine other metrics to capture calibration and minority-class performance. Results show that all three imbalance-handling methods outperform Baseline, with Decision Threshold Calibration providing the most consistent gains, though improvements vary across datasets and models. Notably, SMOTE improves minority detection but can degrade probability calibration, while Threshold Calibration maintains calibration and often yields higher and scores. The findings advocate dataset- and model-aware testing of imbalance strategies, highlighting Decision Threshold Calibration as a practical default, yet underscoring the value of exploring multiple approaches for specific problems.

Abstract

Class imbalance in binary classification tasks remains a significant challenge in machine learning, often resulting in poor performance on minority classes. This study comprehensively evaluates three widely-used strategies for handling class imbalance: Synthetic Minority Over-sampling Technique (SMOTE), Class Weights tuning, and Decision Threshold Calibration. We compare these methods against a baseline scenario of no-intervention across 15 diverse machine learning models and 30 datasets from various domains, conducting a total of 9,000 experiments. Performance was primarily assessed using the F1-score, although our study also tracked results on additional 9 metrics including F2-score, precision, recall, Brier-score, PR-AUC, and AUC. Our results indicate that all three strategies generally outperform the baseline, with Decision Threshold Calibration emerging as the most consistently effective technique. However, we observed substantial variability in the best-performing method across datasets, highlighting the importance of testing multiple approaches for specific problems. This study provides valuable insights for practitioners dealing with imbalanced datasets and emphasizes the need for dataset-specific analysis in evaluating class imbalance handling techniques.
Paper Structure (21 sections, 4 tables)