REDUCR: Robust Data Downsampling Using Class Priority Reweighting

William Bankes; George Hughes; Ilija Bogunovic; Zi Wang

REDUCR: Robust Data Downsampling Using Class Priority Reweighting

William Bankes, George Hughes, Ilija Bogunovic, Zi Wang

TL;DR

On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.

Abstract

Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.

REDUCR: Robust Data Downsampling Using Class Priority Reweighting

TL;DR

On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.

Abstract

Paper Structure (35 sections, 11 equations, 17 figures, 4 tables, 4 algorithms)

This paper contains 35 sections, 11 equations, 17 figures, 4 tables, 4 algorithms.

Introduction
Main contributions.
Related work.
Background
Problem Formulation
REDUCR for Robust Online Batch Selection
Online Learning
Computing selection scores
Class-Irreducible Loss Models
REDUCR as a practical algorithm
Experiments
Key results
Ablation Studies
Scaling up the number of classes
Imbalanced Datasets
...and 20 more sections

Figures (17)

Figure 1: REDUCR starts by initializing weights of classes. At each timestep $t$, the model receives a batch of datapoints $B_t$. REDUCR computes the selection scores for each datapoint based on its usefulness to the model and the class weights, and selects new datapoints $b_t\subset B_t$ that achieve the highest selection scores. After the model takes gradient steps on the selected datapoints, REDUCR adjusts the weights to reflect increased priorities on underperforming classes.
Figure 2: REDUCR significantly improves worst-class test accuracy on Clothing1M outperforming Uniform and other recent works.
Figure 3: REDUCR improves the worst-class test accuracy and data efficiency when compared with the RHO-Loss, Train Loss and Uniform baselines on the \ref{['fig:clothing1m worst class test accuracy']}) Clothing1M dataset, \ref{['fig:cinic10 worst class test accuracy']}) the CINIC10 dataset, and \ref{['fig:cifar100 worst class test accuracy']}) the CIFAR100 dataset.
Figure 4: \ref{['fig:cinic10_ablation_accuracy']}) The worst-class test accuracy decreases when the model loss, class irreducible loss, and class-holdout loss terms are removed from REDUCR on CINIC10. Comparing REDUCR with clipping for excess losses (\ref{['alg:robust active subsampling']}) and REDUCR (no clip) which removes the clipping, we observe that REDUCR achieves more stable performance. We show the class weights $\mathbf{w}$ at each training step for \ref{['fig:cinic10 weights']}) REDUCR and \ref{['fig:cinic10 weights ablation']}) REDUCR with the class-holdout loss term ablated. The ablation model fails to consistently prioritise the underperforming classes across multiple runs.
Figure 5: REDUCR improves the average test accuracy on the Clothing1M dataset.
...and 12 more figures

REDUCR: Robust Data Downsampling Using Class Priority Reweighting

TL;DR

Abstract

REDUCR: Robust Data Downsampling Using Class Priority Reweighting

Authors

TL;DR

Abstract

Table of Contents

Figures (17)