Table of Contents
Fetching ...

REDUCR: Robust Data Downsampling Using Class Priority Reweighting

William Bankes, George Hughes, Ilija Bogunovic, Zi Wang

TL;DR

On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.

Abstract

Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.

REDUCR: Robust Data Downsampling Using Class Priority Reweighting

TL;DR

On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.

Abstract

Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.
Paper Structure (35 sections, 11 equations, 17 figures, 4 tables, 4 algorithms)

This paper contains 35 sections, 11 equations, 17 figures, 4 tables, 4 algorithms.

Figures (17)

  • Figure 1: REDUCR starts by initializing weights of classes. At each timestep $t$, the model receives a batch of datapoints $B_t$. REDUCR computes the selection scores for each datapoint based on its usefulness to the model and the class weights, and selects new datapoints $b_t\subset B_t$ that achieve the highest selection scores. After the model takes gradient steps on the selected datapoints, REDUCR adjusts the weights to reflect increased priorities on underperforming classes.
  • Figure 2: REDUCR significantly improves worst-class test accuracy on Clothing1M outperforming Uniform and other recent works.
  • Figure 3: REDUCR improves the worst-class test accuracy and data efficiency when compared with the RHO-Loss, Train Loss and Uniform baselines on the \ref{['fig:clothing1m worst class test accuracy']}) Clothing1M dataset, \ref{['fig:cinic10 worst class test accuracy']}) the CINIC10 dataset, and \ref{['fig:cifar100 worst class test accuracy']}) the CIFAR100 dataset.
  • Figure 4: \ref{['fig:cinic10_ablation_accuracy']}) The worst-class test accuracy decreases when the model loss, class irreducible loss, and class-holdout loss terms are removed from REDUCR on CINIC10. Comparing REDUCR with clipping for excess losses (\ref{['alg:robust active subsampling']}) and REDUCR (no clip) which removes the clipping, we observe that REDUCR achieves more stable performance. We show the class weights $\mathbf{w}$ at each training step for \ref{['fig:cinic10 weights']}) REDUCR and \ref{['fig:cinic10 weights ablation']}) REDUCR with the class-holdout loss term ablated. The ablation model fails to consistently prioritise the underperforming classes across multiple runs.
  • Figure 5: REDUCR improves the average test accuracy on the Clothing1M dataset.
  • ...and 12 more figures