The Impact of Coreset Selection on Spurious Correlations and Group Robustness
Amaya Dharmasiri, William Yang, Polina Kirichenko, Lydia Liu, Olga Russakovsky
TL;DR
The paper addresses how coreset selection interacts with spurious correlations to affect group robustness when group labels are unavailable. It conducts a large-scale empirical study across ten spurious-bias datasets, comparing learning-based and embedding-based sample characterization scores under multiple selection policies, and assesses bias via a bias level metric and worst-group accuracy. Key findings are that embedding-based scores generally pose a lower bias risk than learning-based ones, that lower coreset bias does not reliably translate to better robustness, and that very small coresets require careful handling with strategies beyond simple group balancing. The work provides practical guidance for data reduction in the presence of hidden biases, emphasizing the importance of sample difficulty and sufficient coreset size to maintain group robustness in real-world settings.
Abstract
Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, as many datasets suffer from biases that cause models to learn spurious correlations instead of causal features, it is important to understand whether and how dataset reduction methods may perpetuate, amplify, or mitigate these biases. In this work, we conduct the first comprehensive analysis of the implications of data selection on the spurious bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes. Thereby, we unravel a series of nontrivial nuances in interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we find that selecting coresets using embedding-based sample characterization scores runs a comparatively lower risk of inadvertently exacerbating bias than selecting using characterizations based on learning dynamics. Most importantly, our analysis reveals that although some coreset selection methods could achieve lower bias levels by prioritizing difficult samples, they do not reliably guarantee downstream robustness.
