Table of Contents
Fetching ...

Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang

TL;DR

The paper tackles inefficiencies and inconsistent benchmarking in large-scale dataset compression by proposing a unified evaluation for dataset distillation and pruning. It exposes how soft-label based methods can inflate gains and introduces the Prune, Combine, and Augment (PCA) framework, which relies on hard labels and image data alone. PCA uses pruning-guided image selection, image combination, and scaling-law aware augmentation to achieve strong performance at extreme compression while avoiding soft-label storage. Through extensive experiments on ImageNet-1K and across model architectures, the authors demonstrate PCA’s competitiveness and argue for a shift back toward image-centric, hard-label evaluation to improve reproducibility and practicality.

Abstract

Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression

Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

TL;DR

The paper tackles inefficiencies and inconsistent benchmarking in large-scale dataset compression by proposing a unified evaluation for dataset distillation and pruning. It exposes how soft-label based methods can inflate gains and introduces the Prune, Combine, and Augment (PCA) framework, which relies on hard labels and image data alone. PCA uses pruning-guided image selection, image combination, and scaling-law aware augmentation to achieve strong performance at extreme compression while avoiding soft-label storage. Through extensive experiments on ImageNet-1K and across model architectures, the authors demonstrate PCA’s competitiveness and argue for a shift back toward image-centric, hard-label evaluation to improve reproducibility and practicality.

Abstract

Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression

Paper Structure

This paper contains 46 sections, 5 theorems, 35 equations, 19 figures, 26 tables.

Key Result

Proposition 4.1

Let $\mathcal{D} = \{ x_i \}_{i=1}^N$ be a dataset of images, and let $P_\theta$ be a probabilistic model parameterized by $\theta$. Lowering $\mathrm{NLL}(\mathcal{D}; \theta)$ through a selective cropping operation $\mathcal{C}$sun2023diversity, resulting in a new dataset $\mathcal{D}' = \mathcal{

Figures (19)

  • Figure 1: Benchmarking SOTA methods using hard labels. "DD (Noise)" and "DD (Real)" denote dataset distillation with noise and real images, respectively. Many methods struggle to outperform the random baseline, and methods utilizing more original images generally achieve better performance. Evaluation uses ResNet-18 on ImageNet-1K. Detailed data is provided in Table \ref{['tab:benchmark-SOTA-hard']}.
  • Figure 2: Benchmarking SOTA methods using soft labels. Many methods struggle to outperform the random baseline, particularly at large IPCs. Evaluation uses ResNet-18 on ImageNet-1K. Detailed data is provided in Table \ref{['tab:benchmark-SOTA-soft']}.
  • Figure 3: Entropy analysis of different datasets with IPC=10. Images are randomly sampled from the corresponding dataset for visualization. The classifier used for entropy analysis is the pretrained EfficientNet-B0 tan2019efficientnet.
  • Figure 4: Patch Shuffling vs. Patch Extraction.
  • Figure 5: Randomly sampled PCA images.
  • ...and 14 more figures

Theorems & Definitions (13)

  • Proposition 4.1: proof in Appendix \ref{['proof:prop']}
  • Theorem 4.2: proof in Appendix \ref{['proof:thm']}
  • Definition 1.1: Negative Log-Likelihood (NLL)
  • Remark 1.2
  • Definition 1.3: Entropy shannon1948mathematical
  • Lemma 1.5: Effect of Selective Cropping on NLL
  • proof
  • Lemma 1.6: Correlation of NLL & Entropy
  • proof
  • proof : Proof of Proposition \ref{['prop:cropping_vs_performance']}
  • ...and 3 more