Table of Contents
Fetching ...

Deep Learning on a Data Diet: Finding Important Examples Early in Training

Mansheej Paul, Surya Ganguli, Gintare Karolina Dziugaite

TL;DR

This work tackles the data-inefficiency problem in deep learning by introducing two early-training importance scores, GraNd and EL2N, that rank training examples by their influence on learning. By averaging over multiple initializations, these scores enable substantial data pruning (e.g., 50% on CIFAR-10) without sacrificing test accuracy, and they generalize across architectures and hyperparameters. Beyond pruning, the authors use these scores to probe training dynamics, revealing that high-EL2N examples drive rapid NTK evolution and contribute to a rougher loss landscape, while low-EL2N subsets yield smoother dynamics. The findings offer a practical path to faster, data-efficient training and provide new insights into how data distribution shapes learning and generalization.

Abstract

Recent success in deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization, and how do we find them? In this work, we make the striking observation that, in standard vision datasets, simple scores averaged over several weight initializations can be used to identify important examples very early in training. We propose two such scores -- the Gradient Normed (GraNd) and the Error L2-Norm (EL2N) scores -- and demonstrate their efficacy on a range of architectures and datasets by pruning significant fractions of training data without sacrificing test accuracy. In fact, using EL2N scores calculated a few epochs into training, we can prune half of the CIFAR10 training set while slightly improving test accuracy. Furthermore, for a given dataset, EL2N scores from one architecture or hyperparameter configuration generalize to other configurations. Compared to recent work that prunes data by discarding examples that are rarely forgotten over the course of training, our scores use only local information early in training. We also use our scores to detect noisy examples and study training dynamics through the lens of important examples -- we investigate how the data distribution shapes the loss surface and identify subspaces of the model's data representation that are relatively stable over training.

Deep Learning on a Data Diet: Finding Important Examples Early in Training

TL;DR

This work tackles the data-inefficiency problem in deep learning by introducing two early-training importance scores, GraNd and EL2N, that rank training examples by their influence on learning. By averaging over multiple initializations, these scores enable substantial data pruning (e.g., 50% on CIFAR-10) without sacrificing test accuracy, and they generalize across architectures and hyperparameters. Beyond pruning, the authors use these scores to probe training dynamics, revealing that high-EL2N examples drive rapid NTK evolution and contribute to a rougher loss landscape, while low-EL2N subsets yield smoother dynamics. The findings offer a practical path to faster, data-efficient training and provide new insights into how data distribution shapes learning and generalization.

Abstract

Recent success in deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization, and how do we find them? In this work, we make the striking observation that, in standard vision datasets, simple scores averaged over several weight initializations can be used to identify important examples very early in training. We propose two such scores -- the Gradient Normed (GraNd) and the Error L2-Norm (EL2N) scores -- and demonstrate their efficacy on a range of architectures and datasets by pruning significant fractions of training data without sacrificing test accuracy. In fact, using EL2N scores calculated a few epochs into training, we can prune half of the CIFAR10 training set while slightly improving test accuracy. Furthermore, for a given dataset, EL2N scores from one architecture or hyperparameter configuration generalize to other configurations. Compared to recent work that prunes data by discarding examples that are rarely forgotten over the course of training, our scores use only local information early in training. We also use our scores to detect noisy examples and study training dynamics through the lens of important examples -- we investigate how the data distribution shapes the loss surface and identify subspaces of the model's data representation that are relatively stable over training.

Paper Structure

This paper contains 39 sections, 1 theorem, 12 equations, 18 figures.

Key Result

Lemma 2.2

Let $S_{\neg j} = S \setminus (x_j,y_j)$. Then for all $(x^*,y^*)$, there exists $c$ such that

Figures (18)

  • Figure 1: Columns correspond to three different dataset and network combinations (labeled at the top). Each legend applies to all 3 figures in its row. First row: Final test accuracy achieved by training on a subset of training data comprised of examples with maximum forgetting, EL2N and GraNd scores computed at different times early in training. Subsets of a fixed size are used: networks are trained on 50% of training data for CIFAR-10, 60% for CINIC-10 and 75% for CIFAR-100. Second row: Final test accuracy achieved by training after different fractions of the dataset are pruned. Here we compare forgetting scores at the end of training and EL2N scores early in training (at epoch 20). In each case, examples with the lowest scores are pruned at initialization. In all experiments accuracies achieved by training on the full dataset and on a random subset of the corresponding size are used as baselines.
  • Figure 2: ResNet18 trained on a 40% subset of CIFAR-10 with clean (left) and 10% randomized labels (right). The training subset contains the lowest scoring examples after examples with scores below the offset are discarded. Scores computed at epoch 10.
  • Figure 3: Kernel velocity for different subsets of images when ResNet18 is trained on CIFAR-10 with all true labels (left) and 10% label noise (right). Examples are sorted in ascending order by EL2N scores and each point corresponds to the kernel velocity of 100 contiguous images starting at example index. Both scores and velocities are computed at the same epoch indicated by color.
  • Figure 4: The final training error barrier between children on subsets of a 1000 highest (green) and lowest (orange) EL2N score examples, and randomly selected training subset (blue) as a function of the spawning time. Left to right: different dataset and network combinations.
  • Figure 5: Examples with the smallest (first row) and second smallest (second row) GraNd scores for each class (columns, from left to right: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) from a ResNet18 trained on CIFAR-10. GraNd scores were calculated at initialization.
  • ...and 13 more figures

Theorems & Definitions (4)

  • Definition 2.1
  • Lemma 2.2
  • proof
  • Definition 2.3