GIFT: Unlocking Full Potential of Labels in Distilled Dataset at Near-zero Cost

Xinyi Shang; Peng Sun; Tao Lin

GIFT: Unlocking Full Potential of Labels in Distilled Dataset at Near-zero Cost

Xinyi Shang, Peng Sun, Tao Lin

TL;DR

GIFT addresses the sensitivity of dataset distillation to loss functions when using soft labels by introducing a universal, plug-and-play approach that refines labels and adopts a cosine similarity loss. The method is underpinned by a mutual-information bound and InfoNCE-based reasoning, plus hard-label smoothing to bolster inter-class dispersion. Empirically, GIFT consistently improves state-of-the-art DD methods across Tiny-ImageNet, ImageNet-1K, and large networks, while incurring near-zero additional cost and enhancing cross-architecture and cross-optimizer generalization. The work delivers a practical, scalable solution with broad impact for continual learning and large-scale distillation, and provides theoretical and empirical support for cosine-based label utilization.

Abstract

Recent advancements in dataset distillation have demonstrated the significant benefits of employing soft labels generated by pre-trained teacher models. In this paper, we introduce a novel perspective by emphasizing the full utilization of labels. We first conduct a comprehensive comparison of various loss functions for soft label utilization in dataset distillation, revealing that the model trained on the synthetic dataset exhibits high sensitivity to the choice of loss function for soft label utilization. This finding highlights the necessity of a universal loss function for training models on synthetic datasets. Building on these insights, we introduce an extremely simple yet surprisingly effective plug-and-play approach, GIFT, which encompasses soft label refinement and a cosine similarity-based loss function to efficiently leverage full label information. Extensive experiments indicate that GIFT consistently enhances state-of-the-art dataset distillation methods across various dataset scales, without incurring additional computational costs. Importantly, GIFT significantly enhances cross-optimizer generalization, an area previously overlooked. For instance, on ImageNet-1K with IPC = 10, GIFT enhances the state-of-the-art method RDED by 30.8% in cross-optimizer generalization. Our code is available at https://github.com/LINs-lab/GIFT.

GIFT: Unlocking Full Potential of Labels in Distilled Dataset at Near-zero Cost

TL;DR

Abstract

Paper Structure (68 sections, 1 theorem, 21 equations, 13 figures, 19 tables)

This paper contains 68 sections, 1 theorem, 21 equations, 13 figures, 19 tables.

Introduction
Related Work
Optimization-based Soft Labels.
Teacher model-based soft labels.
Motivation
Preliminary
Dataset Distillation.
Cross-Optimizer Generalization.
Are Loss Functions Pulling the Strings in Synthetic Dataset Performance?
Method
Label Refinement.
Mutual information bounded loss function.
Experiments
Experiment Setup
Datasets and Networks.
...and 53 more sections

Key Result

Theorem 1

The $\mathcal{V}$-information $I_\mathcal{V}(X,Y)$ is upper bounded by a function involving the cosine similarity between the positive pair $(\mathbf{x}_i, y_i)$, the expected cosine similarity between the anchor $\mathbf{x}_i$ and negative samples $y_j$, and the number of negative samples $K$. Spec where $\tau$ denotes the temperature parameter, $\mathcal{L}_{\textup{InfoNCE}}$oord2018representat

Figures (13)

Figure 1: Top-1 accuracy on various synthetic datasets via the SOTA dataset distillation methods across loss functions on Tiny-ImageNet and ImageNet-1K when IPC=10.value means the results of the loss function used by the distillation method itself (e.g., SRe$^2$L yin2023squeeze uses KL divergence hinton2015distilling). value means the results of our GIFT, and ($\uparrow$) denotes improvements over the dataset distillation methods. It is obvious that our method GIFT significantly enhances the dataset distillation methods.
Figure 2: Top-1 accuracy (%) for the state-of-the-art dataset distillation methods on various synthetic datasets when IPC=10 on ResNet-18 with different $\gamma$.
Figure 3: 5-step and 10-step class-incremental learning on Tiny-ImageNet on ResNet-18.
Figure 4: Top-1 accuracy on various synthetic datasets via the SOTA dataset distillation methods across loss functions on Tiny-ImageNet when IPC=1.
Figure 5: Top-1 accuracy on various synthetic datasets via the SOTA dataset distillation methods across loss functions on ImageNet-1K when IPC=1.
...and 8 more figures

Theorems & Definitions (3)

Definition 1: Cross-optimizer Generalization
Theorem 1
proof

GIFT: Unlocking Full Potential of Labels in Distilled Dataset at Near-zero Cost

TL;DR

Abstract

GIFT: Unlocking Full Potential of Labels in Distilled Dataset at Near-zero Cost

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (3)