Table of Contents
Fetching ...

The Impact of Coreset Selection on Spurious Correlations and Group Robustness

Amaya Dharmasiri, William Yang, Polina Kirichenko, Lydia Liu, Olga Russakovsky

TL;DR

The paper addresses how coreset selection interacts with spurious correlations to affect group robustness when group labels are unavailable. It conducts a large-scale empirical study across ten spurious-bias datasets, comparing learning-based and embedding-based sample characterization scores under multiple selection policies, and assesses bias via a bias level metric and worst-group accuracy. Key findings are that embedding-based scores generally pose a lower bias risk than learning-based ones, that lower coreset bias does not reliably translate to better robustness, and that very small coresets require careful handling with strategies beyond simple group balancing. The work provides practical guidance for data reduction in the presence of hidden biases, emphasizing the importance of sample difficulty and sufficient coreset size to maintain group robustness in real-world settings.

Abstract

Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, as many datasets suffer from biases that cause models to learn spurious correlations instead of causal features, it is important to understand whether and how dataset reduction methods may perpetuate, amplify, or mitigate these biases. In this work, we conduct the first comprehensive analysis of the implications of data selection on the spurious bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes. Thereby, we unravel a series of nontrivial nuances in interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we find that selecting coresets using embedding-based sample characterization scores runs a comparatively lower risk of inadvertently exacerbating bias than selecting using characterizations based on learning dynamics. Most importantly, our analysis reveals that although some coreset selection methods could achieve lower bias levels by prioritizing difficult samples, they do not reliably guarantee downstream robustness.

The Impact of Coreset Selection on Spurious Correlations and Group Robustness

TL;DR

The paper addresses how coreset selection interacts with spurious correlations to affect group robustness when group labels are unavailable. It conducts a large-scale empirical study across ten spurious-bias datasets, comparing learning-based and embedding-based sample characterization scores under multiple selection policies, and assesses bias via a bias level metric and worst-group accuracy. Key findings are that embedding-based scores generally pose a lower bias risk than learning-based ones, that lower coreset bias does not reliably translate to better robustness, and that very small coresets require careful handling with strategies beyond simple group balancing. The work provides practical guidance for data reduction in the presence of hidden biases, emphasizing the importance of sample difficulty and sufficient coreset size to maintain group robustness in real-world settings.

Abstract

Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, as many datasets suffer from biases that cause models to learn spurious correlations instead of causal features, it is important to understand whether and how dataset reduction methods may perpetuate, amplify, or mitigate these biases. In this work, we conduct the first comprehensive analysis of the implications of data selection on the spurious bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes. Thereby, we unravel a series of nontrivial nuances in interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we find that selecting coresets using embedding-based sample characterization scores runs a comparatively lower risk of inadvertently exacerbating bias than selecting using characterizations based on learning dynamics. Most importantly, our analysis reveals that although some coreset selection methods could achieve lower bias levels by prioritizing difficult samples, they do not reliably guarantee downstream robustness.

Paper Structure

This paper contains 19 sections, 1 equation, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Left: Performance of randomly selected coresets from Waterbirds sagawa2019distributionally reveals an increasing discrepancy between worst-group accuracy and average accuracy as coreset size decreases. Right: Distribution of average and worst-group accuracy for current coreset selection algorithms shows large variation in worst-group accuracies even for similar average accuracies.
  • Figure 2: Classifying bias-conflicting samples using characterization scores. We measure the Average Precision of three learning-based methods (EL2N paul2021deep, Uncertainty colemanselection and Forgetting toneva2018empirical) and two embedding-based methods (SelfSup sorscher2022beyond and SupProto xia2022moderate) at classifying bias-conflicting vs bias-aligning samples across 5 datasets. On the more challenging real-world datasets (Urbancars-C li2023whacamoledilemmashortcutscome, Metashift liangmetashift, and Civilcomments borkan2019nuancedmetricsmeasuringunintended), embedding-based methods do not appear to order the samples according to their bias levels (i.e., have near-random AP); even finetuning these embeddings (depicted by the shaded bars) does not change these findings. (Please refer Appendix \ref{['app:c1']} for more results on other datasets)
  • Figure 3: Data bias and classifier accuracies for Difficult (highest-scoring) and Easy (lowest-scoring) scoring samples using EL2N scores. Selecting the Difficult samples typically results in less biased coresets and corresponding more robust (highest worst-group accuracy) classifiers than Easy samples. The Difficult samples also tend to be more robust than those with Random selection. However, for small coreset sizes, we see a drop in average and worst-group accuracies for Difficult samples, which we examine further in Section \ref{['sec:low data regime']}. (Please refer Appendix \ref{['app:c2']} for more results on other datasets.)
  • Figure 4: Worst-group accuracies for Waterbirds sagawa2019distributionally and Urbancars-C li2023whacamoledilemmashortcutscome using group-balanced coresets. Although all methods have equal bias-levels, selecting the most difficult samples from each group results in comparable or worse group-robustness.
  • Figure 5: Effect of excluding most difficult bias-conflicting samples in the small data regime.Median and Stratified selection policies perform better than Difficult selection on worst-group accuracy for small coreset sizes, although the slightly modified Difficult* selection strategy discussed in Section \ref{['sec:low data regime']} mitigates the difference. (Please refer Appendix \ref{['app:c3']} for more results on other datasets)
  • ...and 3 more figures