Table of Contents
Fetching ...

Distilling the Knowledge in Data Pruning

Emanuel Ben-Baruch, Adam Botach, Igor Kviatkovsky, Manoj Aggarwal, Gérard Medioni

TL;DR

This work tackles the challenge of data pruning by adding knowledge distillation from a teacher trained on the full dataset to a student trained on a pruned subset. The authors formulate an KD-augmented objective with an adaptive weight, show that KD dramatically improves accuracy across pruning methods and datasets (including ImageNet) and identify practical insights on teacher capacity and pruning levels. They provide theoretical justification for bias reduction via self-distillation in pruned-data training and demonstrate that random pruning with KD can rival or surpass sophisticated pruning strategies. The findings offer actionable guidance for training under data-limited regimes and come with implementation details and code release plans to facilitate adoption.

Abstract

With the increasing size of datasets used for training neural networks, data pruning becomes an attractive field of research. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available.

Distilling the Knowledge in Data Pruning

TL;DR

This work tackles the challenge of data pruning by adding knowledge distillation from a teacher trained on the full dataset to a student trained on a pruned subset. The authors formulate an KD-augmented objective with an adaptive weight, show that KD dramatically improves accuracy across pruning methods and datasets (including ImageNet) and identify practical insights on teacher capacity and pruning levels. They provide theoretical justification for bias reduction via self-distillation in pruned-data training and demonstrate that random pruning with KD can rival or surpass sophisticated pruning strategies. The findings offer actionable guidance for training under data-limited regimes and come with implementation details and code release plans to facilitate adoption.

Abstract

With the increasing size of datasets used for training neural networks, data pruning becomes an attractive field of research. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available.
Paper Structure (19 sections, 3 theorems, 15 equations, 10 figures, 1 table)

This paper contains 19 sections, 3 theorems, 15 equations, 10 figures, 1 table.

Key Result

Theorem 1

Let $\mathbf{X}\in\mathbb{R}^{d\times N}$ and $\mathbf{y}\in\mathbb{R}^{N}$ be the full observation matrix and label vector, respectively. Let $\mathbf{y}_f=\mathbf{X}_f^T\pmb{\theta}^*+\pmb{\eta}_f$, where $\pmb\theta^*$ is the ground-truth projection vector and $\pmb\eta_f\in\mathbb{R}^N$ is a Gau

Figures (10)

  • Figure 1: Knowledge distillation for data pruning. (a) We investigate the usage of a teacher model, pre-trained on a full dataset, to guide a student model during training on a pruned subset of the same data. (b) We find that by integrating KD into the training, simple random pruning outperforms other sophisticated pruning algorithms across all pruning regimes. (c) Interestingly, we observe that when using small data fractions, training with large teachers degrades accuracy, while smaller teachers are favored. This suggests that in high pruning regimes (low $f$), the training is more sensitive to the capacity gap between the teacher and the student.
  • Figure 2: Learning from the teacher predictions. An example of soft predictions computed by a teacher model trained on the entire data (top), a model trained on $25\%$ of the data (middle), and a student model trained on $25\%$ of the data with KD (bottom), for an evaluation sample of class "Girl" from CIFAR-100. Using KD, the student can better learn close or ambiguous categories by leveraging knowledge captured by the teacher from the full dataset.
  • Figure 3: Highest scoring samples. Top 10 highest scoring samples selected by the 'forgetting' pruning method for CIFAR-100 and SVHN datasets. The labels of the majority of the images are ambiguous due to class complexity or low image quality.
  • Figure 4: Data pruning results with knowledge distillation. Accuracy results across different pruning factors $f$, and various pruning approaches ('forgetting', EL2N, GraNd and random pruning) on the CIFAR-100, SVHN, and CIFAR-10 datasets. We use an equalized weight in the loss (i.e., $\alpha=0.5$). Using KD, significant improvement is achieved across all pruning regimes and all pruning methods. Random pruning outperforms other pruning methods for low pruning factors. For sufficiently high $f$, the accuracy is robust to the choice of the pruning approach in the presence of KD.
  • Figure 5: Data pruning results with KD on ImageNet. Accuracy results across different pruning factors $f$, and various pruning methods on the ImageNet dataset. We use an equalized weight ($\alpha=0.5$) in Eq. \ref{['eq:loss']}.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 1
  • proof
  • Theorem 2
  • proof