Table of Contents
Fetching ...

The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection

Mohammad Jafari, Yimeng Zhang, Yihua Zhang, Sijia Liu

TL;DR

This paper tackles the computational bottleneck of training large-scale models by unifying coreset selection with data reweighting. It introduces CW-ERM, a three-stage approach that selects a representative coreset using median-distance in pretrained feature space, reweights the coreset with MetaWeightNet, and broadcasts the learned weights to the full dataset via nearest-neighbor mapping. Empirical results on CIFAR-10 and CIFAR-100 show that using as little as $0.01$ of the data can achieve higher accuracy than baselines while reducing reweighting time, illustrating both efficiency and robustness. The work presents a scalable framework with clear avenues for applying coreset-based reweighting to broader supervised learning tasks.

Abstract

As machine learning tasks continue to evolve, the trend has been to gather larger datasets and train increasingly larger models. While this has led to advancements in accuracy, it has also escalated computational costs to unsustainable levels. Addressing this, our work aims to strike a delicate balance between computational efficiency and model accuracy, a persisting challenge in the field. We introduce a novel method that employs core subset selection for reweighting, effectively optimizing both computational time and model performance. By focusing on a strategically selected coreset, our approach offers a robust representation, as it efficiently minimizes the influence of outliers. The re-calibrated weights are then mapped back to and propagated across the entire dataset. Our experimental results substantiate the effectiveness of this approach, underscoring its potential as a scalable and precise solution for model training.

The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection

TL;DR

This paper tackles the computational bottleneck of training large-scale models by unifying coreset selection with data reweighting. It introduces CW-ERM, a three-stage approach that selects a representative coreset using median-distance in pretrained feature space, reweights the coreset with MetaWeightNet, and broadcasts the learned weights to the full dataset via nearest-neighbor mapping. Empirical results on CIFAR-10 and CIFAR-100 show that using as little as of the data can achieve higher accuracy than baselines while reducing reweighting time, illustrating both efficiency and robustness. The work presents a scalable framework with clear avenues for applying coreset-based reweighting to broader supervised learning tasks.

Abstract

As machine learning tasks continue to evolve, the trend has been to gather larger datasets and train increasingly larger models. While this has led to advancements in accuracy, it has also escalated computational costs to unsustainable levels. Addressing this, our work aims to strike a delicate balance between computational efficiency and model accuracy, a persisting challenge in the field. We introduce a novel method that employs core subset selection for reweighting, effectively optimizing both computational time and model performance. By focusing on a strategically selected coreset, our approach offers a robust representation, as it efficiently minimizes the influence of outliers. The re-calibrated weights are then mapped back to and propagated across the entire dataset. Our experimental results substantiate the effectiveness of this approach, underscoring its potential as a scalable and precise solution for model training.
Paper Structure (9 sections, 3 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 9 sections, 3 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Analysis of model performance and computational efficiency. The first part illustrates the three-stage process of our method: coreset selection, data reweighting, and weight broadcasting, visualized through four 2D t-SNE embeddings indicated by a, b, c, d. The second part compares the accuracy and time consumption of ERM, W-ERM, and our method (CW-ERM). Our approach not only yields the highest accuracy but also maintains a balance between computational efficiency and performance.
  • Figure 2: The effect of the coreset ratio on the model performance on CIFAR-10 and CIFAR-100, where coreset is solely used for data reweighting. After reweighting data of the coreset, weights are broadcasted back to the full dataset for training. Larger coreset ratios may lead to test accuracy degradation, particularly in more complex datasets like CIFAR-100. "Uniform data reweighting" refers to the process where each data point in the dataset is treated with equal importance, without any specialized weighting scheme.
  • Figure 3: Time consumption breakdown for 5 different coreset ratios. The stacked bar chart shows the average time spent on coreset selection, reweighting, and training for each coreset ratio.