Dataset Distillation via Relative Distribution Matching and Cognitive Heritage

Qianxin Xia; Jiawei Du; Yuhan Zhang; Jielei Wang; Guoming Lu

Dataset Distillation via Relative Distribution Matching and Cognitive Heritage

Qianxin Xia, Jiawei Du, Yuhan Zhang, Jielei Wang, Guoming Lu

TL;DR

The paper tackles the resource-intensive nature of dataset distillation for pre-trained self-supervised vision models by replacing batch-level linear gradient matching with Statistical Flow Matching (SFM), which aligns synthetic data to a fixed global flow computed from original data statistics. It further introduces Classifier Inheritance (CI), which reuses the original dataset's classifier via a lightweight projector during evaluation, enabling near-full-data performance with minimal storage and compute. Across diverse backbones (e.g., CLIP, DINO-v2, EVA-02, MoCo-v3) and datasets (including ImageNet-1k and ImageNet-100), SFM consistently outperforms LGM, and CI yields substantial gains, sometimes approaching full-dataset training accuracy with IPC as low as one image per class. The approach reduces GPU memory and runtime substantially, offering practical potential for edge environments and future extensions to object detection and semantic segmentation.

Abstract

Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.

Dataset Distillation via Relative Distribution Matching and Cognitive Heritage

TL;DR

Abstract

Paper Structure (11 sections, 2 theorems, 13 equations, 6 figures, 5 tables)

This paper contains 11 sections, 2 theorems, 13 equations, 6 figures, 5 tables.

Introduction
Related Work
Method
Rethinking Linear Gradient Matching
Statistical Flow Matching
Classifier Inheritance
Experiments
Distillation Performance on Specific Models
Generalization Performance Across Models
Ablation Analysis
Conclusion

Key Result

Theorem 3.1

In a multiclass classification using the softmax function. Let the number of classes be $C$, and the weight vectors $\bm{W}_{c}$ for each class $c$ are independently initialized from a Gaussian distribution with mean zero $\bm{W}_{c}\sim \mathcal{N}(0, { \sigma_{c}^{\text{2}}})$. For sample $i$, the

Figures (6)

Figure 1: Comparison between LGM and our method (SFM and CI) in terms of distillation time, GPU memory usage, and validation accuracy. The number following each method denotes the augmentations per batch. The distillation model is EVA-02 and the generalized models are CLIP, DINO-v2 and MoCo-v3. Our SFM substantially reduces resource consumption while achieving state-of-the-art validation performance. Furthermore, incorporating CI for evaluation further elevates performance to a higher level.
Figure 2: The framework of our method. Stage I: Given a pre-trained self-supervised vision model, we extract the global statistical center summarized from the original dataset and retain the classifier trained on that dataset, which together serve as the supervision signal for distillation and the inference head during evaluation. Stage II: We optimize the synthetic images by matching their flow through the distillation model to the statistical flow. Stage III: We only train a lightweight linear projector following the evaluation model to align the distillation model’s knowledge representation on synthetic images. Subsequently, the inherited golden classifier is used for inference and prediction.
Figure 3: (left) A comparison between soft label training (solid line) and our CI training (dashed line). (Right) The effect of applying multi-round differentiable augmentation to our synthetic data.
Figure 4: (Left) A comparison between images synthesized by LGM and our SFM. LGM produces images with noticeable artifacts, whereas ours are notably clearer. (Right) A comparison between images synthesized by TCDD and NCDD. TCDD effectively captures discriminative features of the target class, while NCDD deviates entirely from the target-class representation.
Figure 5: The flow on the ImageNet-100. We perform PCA for dimensionality reduction, visualizing 20 classes. Compared to LGM, our synthetic flow aligns more closely with the global statistical flow. Cosine similarity between the synthetic and statistical flow for all samples is 88.2 and 95.6 in the two cases, respectively.
...and 1 more figures

Theorems & Definitions (2)

Theorem 3.1: Exchangeability
Theorem 3.2: LogNormal Distribution

Dataset Distillation via Relative Distribution Matching and Cognitive Heritage

TL;DR

Abstract

Dataset Distillation via Relative Distribution Matching and Cognitive Heritage

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)