Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

Björn Nieth; Thomas Altstidl; Leo Schwinn; Björn Eskofier

Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

Björn Nieth, Thomas Altstidl, Leo Schwinn, Björn Eskofier

TL;DR

The paper tackles the challenge of high computational cost in adversarial training when large synthetic datasets are used. It introduces data importance extrapolation: estimating DU-based pruning scores for unseen data by averaging the scores of the $k$ nearest neighbors in a learned embedding space, enabling scalable pruning with expensive ranking methods. By applying this to DU and a new FP (frequency-based) pruning metric, the authors show robust and clean accuracy improvements under 50% pruning on large synthetic CIFAR-10 data, with class-balanced pruning often performing best. This data-centric pruning approach significantly improves the practicality of adversarial training with synthetic data, though it relies on extrapolation quality and suggests future work on richer data attribution methods.

Abstract

Their vulnerability to small, imperceptible attacks limits the adoption of deep learning models to real-world systems. Adversarial training has proven to be one of the most promising strategies against these attacks, at the expense of a substantial increase in training time. With the ongoing trend of integrating large-scale synthetic data this is only expected to increase even further. Thus, the need for data-centric approaches that reduce the number of training samples while maintaining accuracy and robustness arises. While data pruning and active learning are prominent research topics in deep learning, they are as of now largely unexplored in the adversarial training literature. We address this gap and propose a new data pruning strategy based on extrapolating data importance scores from a small set of data to a larger set. In an empirical evaluation, we demonstrate that extrapolation-based pruning can efficiently reduce dataset size while maintaining robustness.

Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

TL;DR

nearest neighbors in a learned embedding space, enabling scalable pruning with expensive ranking methods. By applying this to DU and a new FP (frequency-based) pruning metric, the authors show robust and clean accuracy improvements under 50% pruning on large synthetic CIFAR-10 data, with class-balanced pruning often performing best. This data-centric pruning approach significantly improves the practicality of adversarial training with synthetic data, though it relies on extrapolation quality and suggests future work on richer data attribution methods.

Abstract

Paper Structure (18 sections, 2 equations, 5 figures, 4 tables)

This paper contains 18 sections, 2 equations, 5 figures, 4 tables.

Introduction
Background
Extrapolating data importance scores
Experiment Setup
Results
Dynamic uncertainty analysis
Dynamic uncertainty in adversarial training
Extrapolation experiments
Limitations
Extrapolation model
Pruning metric
Performance Evaluation
Related Work
Adversarial Training
Data Selection in Adversarial Training
...and 3 more sections

Figures (5)

Figure 1: Mean Absolute Error (MAE) for different data importance extrapolation settings. $k$ denotes the number of nearest neighbors used for extrapolation.
Figure 2: Distribution of adversarial DU scores vs. standard DU scores.
Figure 3: The upper two plots show predicted certainties for individual samples during standard training (left) and adversarial training (right). The images below visualize the relationship between the DU of a sample and the mean magnitude of the DFT calculated on the training dynamics of the sample. We distinguish the mean magnitude of the low-frequency spectrum (frequencies 1-10) and the high-frequency spectrum (11-150). $R$ denotes the Pearson correlation between the magnitude and the DU.
Figure 4: Overlap between the sets pruned by $\Tilde{U}_{adv}$ vs. $U$ for 50% pruning on CIFAR-10. Note that only for two thirds of all images they result in the same pruning decision.
Figure 5: Extrapolated DU score distribution compared to ground truth DU score distribution. Extrapolated scores are biased toward the mean of the ground truth distribution.

Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

TL;DR

Abstract

Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)