The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection
Mohammad Jafari, Yimeng Zhang, Yihua Zhang, Sijia Liu
TL;DR
This paper tackles the computational bottleneck of training large-scale models by unifying coreset selection with data reweighting. It introduces CW-ERM, a three-stage approach that selects a representative coreset using median-distance in pretrained feature space, reweights the coreset with MetaWeightNet, and broadcasts the learned weights to the full dataset via nearest-neighbor mapping. Empirical results on CIFAR-10 and CIFAR-100 show that using as little as $0.01$ of the data can achieve higher accuracy than baselines while reducing reweighting time, illustrating both efficiency and robustness. The work presents a scalable framework with clear avenues for applying coreset-based reweighting to broader supervised learning tasks.
Abstract
As machine learning tasks continue to evolve, the trend has been to gather larger datasets and train increasingly larger models. While this has led to advancements in accuracy, it has also escalated computational costs to unsustainable levels. Addressing this, our work aims to strike a delicate balance between computational efficiency and model accuracy, a persisting challenge in the field. We introduce a novel method that employs core subset selection for reweighting, effectively optimizing both computational time and model performance. By focusing on a strategically selected coreset, our approach offers a robust representation, as it efficiently minimizes the influence of outliers. The re-calibrated weights are then mapped back to and propagated across the entire dataset. Our experimental results substantiate the effectiveness of this approach, underscoring its potential as a scalable and precise solution for model training.
