Table of Contents
Fetching ...

Identifying Key Challenges of Hardness-Based Resampling

Pawel Pukowski, Venet Osmani

TL;DR

This work interrogates hardness-based resampling as a means to reduce class-wise performance disparities by aligning class training data with estimated hardness via sample complexity. It assesses model-based hardness estimators (AUM, EL2N, Forgetting) on CIFAR-10/100, implementing undersampling and four oversampling schemes with a tunable imbalance factor, and evaluates robustness across ensemble sizes. Across extensive experiments, hardness-based resampling yields negligible, non-systematic gains in class-level performance or gap reduction on balanced data, challenging the practical applicability of the approach. The authors identify core obstacles—no ground-truth hardness, instability of hardness rankings across estimators and datasets, and oversampling's limited expressive power—and demonstrate, in a pruning case study, that carefully structured imbalance can sometimes improve overall accuracy and fairness, pointing to promising directions such as advanced data generation and alternative regularization strategies.

Abstract

Performance gap across classes remains a persistent challenge in machine learning, often attributed to variations in class hardness. One way to quantify class hardness is through sample complexity - the minimum number of samples required to effectively learn a given class. Sample complexity theory suggests that class hardness is driven by differences in the amount of data required for generalization. That is, harder classes need substantially more samples to achieve generalization. Therefore, hardness-based resampling is a promising approach to mitigate these performance disparities. While resampling has been studied extensively in data-imbalanced settings, its impact on balanced datasets remains unexplored. This raises the fundamental question whether resampling is effective because it addresses data imbalance or hardness imbalance. We begin addressing this question by introducing class imbalance into balanced datasets and evaluate its effect on performance disparities. We oversample hard classes and undersample easy classes to bring hard classes closer to their sample complexity requirements while maintaining a constant dataset size for fairness. We estimate class-level hardness using the Area Under the Margin (AUM) hardness estimator and leverage it to compute resampling ratios. Using these ratios, we perform hardness-based resampling on the well-known CIFAR-10 and CIFAR-100 datasets. Contrary to theoretical expectations, our results show that hardness-based resampling does not meaningfully affect class-wise performance disparities. To explain this discrepancy, we conduct detailed analyses to identify key challenges unique to hardness-based imbalance, distinguishing it from traditional data-based imbalance. Our insights help explain why theoretical sample complexity expectations fail to translate into practical performance gains and we provide guidelines for future research.

Identifying Key Challenges of Hardness-Based Resampling

TL;DR

This work interrogates hardness-based resampling as a means to reduce class-wise performance disparities by aligning class training data with estimated hardness via sample complexity. It assesses model-based hardness estimators (AUM, EL2N, Forgetting) on CIFAR-10/100, implementing undersampling and four oversampling schemes with a tunable imbalance factor, and evaluates robustness across ensemble sizes. Across extensive experiments, hardness-based resampling yields negligible, non-systematic gains in class-level performance or gap reduction on balanced data, challenging the practical applicability of the approach. The authors identify core obstacles—no ground-truth hardness, instability of hardness rankings across estimators and datasets, and oversampling's limited expressive power—and demonstrate, in a pruning case study, that carefully structured imbalance can sometimes improve overall accuracy and fairness, pointing to promising directions such as advanced data generation and alternative regularization strategies.

Abstract

Performance gap across classes remains a persistent challenge in machine learning, often attributed to variations in class hardness. One way to quantify class hardness is through sample complexity - the minimum number of samples required to effectively learn a given class. Sample complexity theory suggests that class hardness is driven by differences in the amount of data required for generalization. That is, harder classes need substantially more samples to achieve generalization. Therefore, hardness-based resampling is a promising approach to mitigate these performance disparities. While resampling has been studied extensively in data-imbalanced settings, its impact on balanced datasets remains unexplored. This raises the fundamental question whether resampling is effective because it addresses data imbalance or hardness imbalance. We begin addressing this question by introducing class imbalance into balanced datasets and evaluate its effect on performance disparities. We oversample hard classes and undersample easy classes to bring hard classes closer to their sample complexity requirements while maintaining a constant dataset size for fairness. We estimate class-level hardness using the Area Under the Margin (AUM) hardness estimator and leverage it to compute resampling ratios. Using these ratios, we perform hardness-based resampling on the well-known CIFAR-10 and CIFAR-100 datasets. Contrary to theoretical expectations, our results show that hardness-based resampling does not meaningfully affect class-wise performance disparities. To explain this discrepancy, we conduct detailed analyses to identify key challenges unique to hardness-based imbalance, distinguishing it from traditional data-based imbalance. Our insights help explain why theoretical sample complexity expectations fail to translate into practical performance gains and we provide guidelines for future research.

Paper Structure

This paper contains 39 sections, 21 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Training an ensemble of ten ResNet18 networks on CIFAR-10 (left) and CIFAR-100 (right) reveals large recall gaps across classes, despite the balanced nature of these datasets. Paired with significantly larger recall gaps across classes for CIFAR-100 than CIFAR-10, this shows class- and dataset-level hardness discrepancies, which we call hardness-based imbalance. We believe that this imbalance can be addressed by hardness-based resampling—oversampling hard classes, and undersampling easy ones.
  • Figure 2: In this work, we begin with data-balanced datasets. Our pipeline starts by estimating class hardness. This estimate is used to compute the resampling ratio, which determines the degree of undersampling for easy classes (light green) and oversampling for hard ones (dark green). The aim of introducing this data imbalance is to decrease the performance gap across classes by counteracting the inherent hardness-based imbalance.
  • Figure 3: We use the above sampling probability (Eq. \ref{['eq:equation7']}) instead of a linear one to avoid overly aggressive oversampling samples on the extremes of hardness spectrum.
  • Figure 4: Sorted class-wise data distribution after resampling using various $\alpha$ to control imbalance. Hardness-based resampling adds more samples to an average hard class (red region), than it removes from an average easy class (green region).
  • Figure 5: We adjust the noise removal threshold proposed by Pleiss et al. pleiss2020identifying for two reasons: (a) their threshold removes over a third of samples from some classes, creating class imbalance that complicates hardness estimation; and (b) the cumulative hardness distribution suggests an elbow point as the noise removal threshold.
  • ...and 13 more figures