Table of Contents
Fetching ...

Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance

Pawel Pukowski, Haiping Lu

TL;DR

The paper challenges the reliance on test accuracy as the sole AutoML metric by revealing in-class data imbalance caused by the distribution of hard versus easy samples. It generalizes the inversion-point framework from binary to multiclass via the radii of class manifolds $R^2_i(t)$ and an ensemble of $100$ fully connected networks to identify per-class hard samples, enabling dataset-wide straggler sets. A benchmarking procedure comparing straggler-, confidence-, and energy-based hard-sample identifiers is proposed, showing that hard-sample distribution can significantly alter perceived generalization and that training on hard samples can improve hard-test performance more than easy-test performance. The work advocates broader, sample-complexity-aware evaluation criteria in AutoML, acknowledges limitations of the manifold-based approach, and invites future research into nuanced data practices for more robust model evaluation.

Abstract

In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy, underpinning a wide array of applications from neural architecture search to hyperparameter optimization. However, the reliability of test accuracy as the primary performance metric has been called into question, notably through research highlighting how label noise can obscure the true ranking of state-of-the-art models. We venture beyond, along another perspective where the existence of hard samples within datasets casts further doubt on the generalization capabilities inferred from test accuracy alone. Our investigation reveals that the distribution of hard samples between training and test sets affects the difficulty levels of those sets, thereby influencing the perceived generalization capability of models. We unveil two distinct generalization pathways-toward easy and hard samples-highlighting the complexity of achieving balanced model evaluation. Finally, we propose a benchmarking procedure for comparing hard sample identification methods, facilitating the advancement of more nuanced approaches in this area. Our primary goal is not to propose a definitive solution but to highlight the limitations of relying primarily on test accuracy as an evaluation metric, even when working with balanced datasets, by introducing the in-class data imbalance problem. By doing so, we aim to stimulate a critical discussion within the research community and open new avenues for research that consider a broader spectrum of model evaluation criteria. The anonymous code is available at https://github.com/PawPuk/CurvBIM blueunder the GPL-3.0 license.

Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance

TL;DR

The paper challenges the reliance on test accuracy as the sole AutoML metric by revealing in-class data imbalance caused by the distribution of hard versus easy samples. It generalizes the inversion-point framework from binary to multiclass via the radii of class manifolds and an ensemble of fully connected networks to identify per-class hard samples, enabling dataset-wide straggler sets. A benchmarking procedure comparing straggler-, confidence-, and energy-based hard-sample identifiers is proposed, showing that hard-sample distribution can significantly alter perceived generalization and that training on hard samples can improve hard-test performance more than easy-test performance. The work advocates broader, sample-complexity-aware evaluation criteria in AutoML, acknowledges limitations of the manifold-based approach, and invites future research into nuanced data practices for more robust model evaluation.

Abstract

In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy, underpinning a wide array of applications from neural architecture search to hyperparameter optimization. However, the reliability of test accuracy as the primary performance metric has been called into question, notably through research highlighting how label noise can obscure the true ranking of state-of-the-art models. We venture beyond, along another perspective where the existence of hard samples within datasets casts further doubt on the generalization capabilities inferred from test accuracy alone. Our investigation reveals that the distribution of hard samples between training and test sets affects the difficulty levels of those sets, thereby influencing the perceived generalization capability of models. We unveil two distinct generalization pathways-toward easy and hard samples-highlighting the complexity of achieving balanced model evaluation. Finally, we propose a benchmarking procedure for comparing hard sample identification methods, facilitating the advancement of more nuanced approaches in this area. Our primary goal is not to propose a definitive solution but to highlight the limitations of relying primarily on test accuracy as an evaluation metric, even when working with balanced datasets, by introducing the in-class data imbalance problem. By doing so, we aim to stimulate a critical discussion within the research community and open new avenues for research that consider a broader spectrum of model evaluation criteria. The anonymous code is available at https://github.com/PawPuk/CurvBIM blueunder the GPL-3.0 license.
Paper Structure (8 sections, 5 figures)

This paper contains 8 sections, 5 figures.

Figures (5)

  • Figure 1: Let's consider a hypothetical binary classification scenario (a) featuring two distinct class manifolds. The real-world equivalent of this scenario would involve a point cloud non-uniformly sampled from the class manifolds, with some added label noise (stars in b). In this work, we propose that the difficulty of training is derived from the geometrical and topological properties of the class manifolds, leading to areas with higher/lower sample complexity (red/green respectively), due to factors like curvature and homology. Consequently, we observe the emergence of the in-class data imbalance problem (d), which stems from the fact that, although it is more challenging to learn from hard samples because of their sample complexity, the datasets are predominantly composed of easy, not hard, samples.
  • Figure 2: By generalizing the method introduced by ciceri2024inversion to multiclass MNIST, distinct inversion points emerge for each class. This observation signifies that the dynamics of manifolds segmentation are class-specific—a nuance not captured by before due to previous focus on the binary classification setting. The results on other datasets are available in Appendix \ref{['sec:generalization_of_stragglers_appendix']}.
  • Figure 3: Increasing the proportion of hard samples in the training set improves accuracy on the entire test set (Row 1), but the difference comes mostly from improved accuracy on other hard samples (Row 2), rather than easy samples (Row 3).
  • Figure 4: Increasing the proportion of easy samples in the training set improves accuracy on easy samples in the test set (Row 3), resulting in increased overall accuracy (Row 1), while decreasing the accuracy on hard samples in the test set (Row 2), which resembles results we would get when adding samples from the majority class in between-class data imbalance problem.
  • Figure 5: The in-class data imbalance becomes bigger when we combine the results of the performance on easy samples with those on hard samples in a single figure. It becomes clear that models achieve significantly better accuracy on easy samples than on hard samples, a discrepancy also observed in the between-class data imbalance between majority and minority classes. This in-class data imbalance can become either less or more pronounced depending on how successfully we manage to divide the dataset into hard and easy samples. The overlap between hard and easy sample sets, due to less precise identification, leads to a lower in-class data imbalance, as observed when comparing the accuracies of straggler-based methods with those of confidence-based or energy-based methods. At the top, green (confidence-based) and blue (energy-based) lines overlap each other at this scale.