Table of Contents
Fetching ...

Heavy Labels Out! Dataset Distillation with Label Space Lightening

Ruonan Yu, Songhua Liu, Zigeng Chen, Jingwen Ye, Xinchao Wang

TL;DR

This work tackles the heavy-label bottleneck in large-scale dataset distillation by proposing HeLlO, a label-lightening framework that replaces stored soft labels with an online, CLIP-informed image-to-label projector. It introduces a LoRA-like low-rank knowledge transfer and a text-guided initialization to efficiently adapt a foundation-model–based projector to target datasets, while an image-level update tightens the alignment between original and distilled label spaces. Synthetic data are initialized from representative image patches and subsequently updated to minimize information loss, enabling high-quality label generation without large storage overhead. Empirically, HeLlO matches or surpasses state-of-the-art large-scale distillation methods on ImageNet-100/1K using only about $0.003\%$ of the original label storage, and demonstrates strong cross-architecture generalization and continual-learning performance.

Abstract

Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets. Our code will be available.

Heavy Labels Out! Dataset Distillation with Label Space Lightening

TL;DR

This work tackles the heavy-label bottleneck in large-scale dataset distillation by proposing HeLlO, a label-lightening framework that replaces stored soft labels with an online, CLIP-informed image-to-label projector. It introduces a LoRA-like low-rank knowledge transfer and a text-guided initialization to efficiently adapt a foundation-model–based projector to target datasets, while an image-level update tightens the alignment between original and distilled label spaces. Synthetic data are initialized from representative image patches and subsequently updated to minimize information loss, enabling high-quality label generation without large storage overhead. Empirically, HeLlO matches or surpasses state-of-the-art large-scale distillation methods on ImageNet-100/1K using only about of the original label storage, and demonstrates strong cross-architecture generalization and continual-learning performance.

Abstract

Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets. Our code will be available.
Paper Structure (20 sections, 1 theorem, 7 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 1 theorem, 7 equations, 2 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Text embedding initialized linear transformation is equivalent to the pre-trained zero-shot classification.

Figures (2)

  • Figure 1: The soft label generation part of the current state-of-the-art large-scale dataset distillation (left), and our proposed online lightening image-to-label projector framework (right). For the current state-of-the-art large-scale dataset distillation, for each downstream training epoch, soft labels are generated for each augmented image and stored all the soft labels. For our proposed method, we adopt the open-source foundation models as the base models, which are fixed during the whole training process, and introduce a LoRA-like knowledge transfer method to narrow the gap between the original label space and the target one. We only need to store the low-rank matrices, which significantly reduces the storage costs.
  • Figure 2: The results on the continual learning for 5-step (left) and 10-step (right). All experiments are conducted under the setting of IPC 10 for ImageNet-100.

Theorems & Definitions (2)

  • Proposition 1
  • proof