Distilling Long-tailed Datasets

Zhenghao Zhao; Haoxuan Wang; Yuzhang Shang; Kai Wang; Yan Yan

Distilling Long-tailed Datasets

Zhenghao Zhao, Haoxuan Wang, Yuzhang Shang, Kai Wang, Yan Yan

TL;DR

This work introduces long-tailed dataset distillation (LTDD), addressing the breakdown of standard dataset distillation when the target data are highly imbalanced. It identifies two root causes: biased gradients from distilling imbalanced data and suboptimal tail-class guidance from biased experts. To overcome this, the authors propose Distribution-agnostic Matching (DAM) to align gradient distributions without propagating weight imbalances, and Expert Decoupling (ED) to jointly and separately optimize representation and classification pathways, using reliable soft-label initialization. Evaluations on CIFAR-10-LT, CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show state-of-the-art results, including lossless performance in some settings and strong cross-architecture generalization, marking the first effective LTDD method with robust tail-class performance and practical implications for training efficiency on real-world imbalanced data.

Abstract

Dataset distillation aims to synthesize a small, information-rich dataset from a large one for efficient model training. However, existing dataset distillation methods struggle with long-tailed datasets, which are prevalent in real-world scenarios. By investigating the reasons behind this unexpected result, we identified two main causes: 1) The distillation process on imbalanced datasets develops biased gradients, leading to the synthesis of similarly imbalanced distilled datasets. 2) The experts trained on such datasets perform suboptimally on tail classes, resulting in misguided distillation supervision and poor-quality soft-label initialization. To address these issues, we first propose Distribution-agnostic Matching to avoid directly matching the biased expert trajectories. It reduces the distance between the student and the biased expert trajectories and prevents the tail class bias from being distilled to the synthetic dataset. Moreover, we improve the distillation guidance with Expert Decoupling, which jointly matches the decoupled backbone and classifier to improve the tail class performance and initialize reliable soft labels. This work pioneers the field of long-tailed dataset distillation, marking the first effective effort to distill long-tailed datasets.

Distilling Long-tailed Datasets

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 8 figures, 10 tables)

This paper contains 20 sections, 6 equations, 8 figures, 10 tables.

Introduction
Long-tailed Dataset Distillation
Problem Formulation
Distribution-agnostic Matching
Improved Guidance with Expert Decoupling
Experiments
Experiment Datasets
Main Results
Ablation Studies
Long-tailed Dataset Distillation Analysis
Related Work
Dataset Distillation
Long-tailed Recognition
Conclusion
Training Details
...and 5 more sections

Figures (8)

Figure 1: Performance comparison on CIFAR-10-LT. Existing Dataset Distillation methods exhibit degraded performance when applied to imbalanced datasets, especially when the imbalance factor increases, whereas our method provides significantly better performance under different imbalanced scenarios.
Figure 2: Relationship of classifier weights and class-wise accuracy. We reveal that classifiers generated by existing dataset distillation methods often exhibit imbalanced distributions, resulting in poor performance on tail classes. In contrast, our method produces balanced classifiers, thereby enhancing overall accuracy.
Figure 3: Effect of biased expert. (a) An expert trained on an imbalanced dataset leads to increasingly imbalanced weight gradients over classes. (b) Existing dataset distillation methods cazenavette2022datasetmttdatm are ignorant of the distribution gap between $\mathcal{S}_{t}$ and $\mathcal{D}$. This causes the student gradient imbalance to increase in each step. If we match such trajectories, the synthetic dataset will be updated by the increasingly imbalanced gradients over classes, and the model trained on this synthetic dataset is highly biased. (c) With Distribution-agnostic Matching, the increasingly imbalanced gradients over classes will be re-weighted, such that the student model is updated with balanced gradients.
Figure 4: Comparison of the internal loop between traditional DD methods and ours. The expert model $\mathcal{M_{\mathcal{D}}}$ is trained on a long-tailed dataset, leading to biased weights. Left: For traditional DD methods, directly matching with these weights causes large information loss in the tail classes. The backward propagation updates this imbalance from the expert trajectories to the synthetic dataset. Right: With Distribution-agnostic Matching, the gradients are obtained via $\mathcal{L}^c$, which revises the student weight for matching in the internal loop. This mitigates the distance between the student weight and the expert trajectory to reduce the influence of imbalance on the synthetic dataset.
Figure 5: Soft-label initialization on CIFAR-10-LT. We visualize the average predictive confidence of experts on different classes. For traditional DD, while the soft labels can be initialized well on head classes, they are predicted poorly on the tailed classes, leading to insufficient supervision.
...and 3 more figures

Distilling Long-tailed Datasets

TL;DR

Abstract

Distilling Long-tailed Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (8)