Towards Principled Dataset Distillation: A Spectral Distribution Perspective

Ruixi Wu; Shaobo Wang; Jiahuan Chen; Zhiyuan Liu; Yicun Yang; Zhaorun Chen; Zekai Li; Kaixin Li; Xinming Wang; Hongzhu Yi; Kai Wang; Linfeng Zhang

Towards Principled Dataset Distillation: A Spectral Distribution Perspective

Ruixi Wu, Shaobo Wang, Jiahuan Chen, Zhiyuan Liu, Yicun Yang, Zhaorun Chen, Zekai Li, Kaixin Li, Xinming Wang, Hongzhu Yi, Kai Wang, Linfeng Zhang

TL;DR

Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function, and exploits the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes.

Abstract

Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic counterparts for efficient model training. However, existing DD methods exhibit substantial performance degradation on long-tailed datasets. We identify two fundamental challenges: heuristic design choices for distribution discrepancy measure and uniform treatment of imbalanced classes. To address these limitations, we propose Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function. This technique maps the original samples into frequency space, resulting in the Spectral Distribution Distance (SDD). To mitigate class imbalance, we exploit the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes. On CIFAR-10-LT, with 10 images per class, CSDM achieves a 14.0% improvement over state-of-the-art DD methods, with only a 5.7% performance drop when the number of images in tail classes decreases from 500 to 25, demonstrating strong stability on long-tailed data.

Towards Principled Dataset Distillation: A Spectral Distribution Perspective

TL;DR

Abstract

Paper Structure (41 sections, 9 theorems, 96 equations, 5 figures, 5 tables)

This paper contains 41 sections, 9 theorems, 96 equations, 5 figures, 5 tables.

Introduction
Related Works
Dataset Distillation with Distribution Matching
Long-tailed Dataset Classification
Preliminaries: Distribution Matching
Misnomer of Mean Square Error in DM.
Methodology
Rethinking Previous Methods from a Kernel Perspective
Class-Aware Spectral Distribution Matching
Efficient Spectral Distribution Metric Design via Universal RKHS
Spectral Distribution Distance
Determining Spectral Distribution of SDD
Class-Aware Decomposition
Experiments
Setup
...and 26 more sections

Key Result

Theorem 4.1

Let $\mathcal{F}$ be the unit ball in a universal RKHS $\mathcal{H}$ defined on a compact metric space $\mathcal{X}$, with kernel $k(\cdot,\cdot)$. Then:

Figures (5)

Figure 1: Dataset Distillation with Class-Aware Spectral Distribution Matching (CSDM). (a) Kernel Embedding. A feature network extracts features from real and synthetic data. Unlike previous methods using a linear kernel that maps to a suboptimal RKHS, we design a universal kernel that enables accurate measurement of distribution discrepancies. (b) Class-Aware Spectral Distribution Matching. We quantify distribution discrepancies using the Spectral Distribution Distance (SDD). By decomposing SDD into amplitude and phase and applying class-specific weights, CSDM dynamically adapts to class imbalance.
Figure 2: Impact of kernel choice and scale factor on CIFAR-10-LT (IPC=50, imf=200). Results show that under appropriate scale parameters, the universal kernel achieves higher accuracy than the linear kernel, particularly the RBF kernel. The results also demonstrate the importance of scale factors for distillation performance, indicating that SDD allows for more tolerant parameter selection compared to MMD.
Figure 3: Ablation on the class-aware weighting for amplitude and phase on CIFAR-10-LT (IPC=10). We set the amplitude weight for head classes to $\alpha_{head}$ and for tail classes to $1 - \alpha_{head}$. The results show that optimal performance is achieved by emphasizing diversity (amplitude) for head classes and realism (phase) for tail classes, with a peak around 0.7–0.9 when the imbalance factor (imf) is large.
Figure 4: Performance of Different Kernel Methods on CIFAR-10LT with Varying Imbalance Factors.
Figure 5: Effect of the $\bm{\alpha}$ parameter: Images synthesized by CSDM on CIFAR-10 (IPC=10) for values of 0 and 1. As shown in the images, amplitude-only distillation ($\alpha=1$) yields diverse but less realistic images, whereas phase-only distillation ($\alpha=0$) produces realistic images but suffers from limited diversity.

Theorems & Definitions (16)

Theorem 4.1: Property of Universal MMD gretton2007kernel
Theorem 4.2: Bochner's Theorem bochner1932vorlesungen
Theorem 4.3
Definition A.1
Theorem A.2
proof
Theorem A.3
proof
Theorem A.4
proof
...and 6 more

Towards Principled Dataset Distillation: A Spectral Distribution Perspective

TL;DR

Abstract

Towards Principled Dataset Distillation: A Spectral Distribution Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (16)