Dataset Distillation by Automatic Training Trajectories

Dai Liu; Jindong Gu; Hu Cao; Carsten Trinitis; Martin Schulz

Dataset Distillation by Automatic Training Trajectories

Dai Liu, Jindong Gu, Hu Cao, Carsten Trinitis, Martin Schulz

TL;DR

This work identifies Accumulated Mismatching Problem (AMP) as a key drawback of fixed-length, long-range dataset distillation and introduces Automatic Training Trajectories (ATT) to adaptively choose trajectory lengths by minimizing the distance between synthetic and expert targets at all candidate steps. By selecting the optimal step $N_{opt}$ with a minimum distance $e_t = ||\theta'_{i,t}-\theta^*_{i,N_T}||^2$, ATT eliminates the accumulation of matching errors and improves generalization to unseen architectures. Empirical results across CIFAR-10/100, Tiny ImageNet, and ImageNet subsets show ATT outperforms prior baselines, especially in cross-architecture generalization, while remaining competitive in storage and computation relative to existing long-range methods. The approach yields stronger CA metrics and more stable performance under parameter variations, signaling a practical advance for efficient, robust synthetic-data distillation. $N_S$, $N_T$, AMP, and $e_t$ are central to the method, with ATT offering a dynamic alternative to the traditional fixed trajectory length scheme.$

Abstract

Dataset Distillation is used to create a concise, yet informative, synthetic dataset that can replace the original dataset for training purposes. Some leading methods in this domain prioritize long-range matching, involving the unrolling of training trajectories with a fixed number of steps (NS) on the synthetic dataset to align with various expert training trajectories. However, traditional long-range matching methods possess an overfitting-like problem, the fixed step size NS forces synthetic dataset to distortedly conform seen expert training trajectories, resulting in a loss of generality-especially to those from unencountered architecture. We refer to this as the Accumulated Mismatching Problem (AMP), and propose a new approach, Automatic Training Trajectories (ATT), which dynamically and adaptively adjusts trajectory length NS to address the AMP. Our method outperforms existing methods particularly in tests involving cross-architectures. Moreover, owing to its adaptive nature, it exhibits enhanced stability in the face of parameter variations.

Dataset Distillation by Automatic Training Trajectories

TL;DR

with a minimum distance

, ATT eliminates the accumulation of matching errors and improves generalization to unseen architectures. Empirical results across CIFAR-10/100, Tiny ImageNet, and ImageNet subsets show ATT outperforms prior baselines, especially in cross-architecture generalization, while remaining competitive in storage and computation relative to existing long-range methods. The approach yields stronger CA metrics and more stable performance under parameter variations, signaling a practical advance for efficient, robust synthetic-data distillation.

, AMP, and

are central to the method, with ATT offering a dynamic alternative to the traditional fixed trajectory length scheme.$

Abstract

Paper Structure (22 sections, 6 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 10 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Dataset Distillation
Sample Selection
Method
LDD
Accumulated Mismatching Problem (AMP)
Automatic Training Trajectories (ATT)
Experiments
Dataset and Experimental Setup
Cross-Architecture Generalization
Benchmark Comparison
Ablation Study on Parameters
Storage and Computation
Conclusions
...and 7 more sections

Figures (10)

Figure 1: The plots shows L1 distance between each network from a training trajectory and the corresponding target, and chosen target by different method. The experiments are carried on CIFAR-10. Existing methods employs Fixed Training Length (FTL), which select network at the end of a trajectory. But our method ATT dynamically selects network possessing closest distance to targets. Left plot shows examples from beginning iterations, and right plot shows higher iterations. ATT dynamically adjusts matching target, thus avoid large matching error and unwanted stretching of trajectories.
Figure 2: The figure illustrates that Fixed Trajectory Length (FTL) matches all experts with avoidable matching error throughout the distillation process. The figure is generated from experiments conducted on CIFAR-10 with Images Per Class (IPC) set to 1. We collect the number of cases matches with larger matching errors $\|N_S-N_{opt}\|\geq\gamma$, at every 50 iterations throughout the distillation process. The number of cases matched with larger errors fluctuates over entire process. The same can be observed with the mean value of line a. From left to right, we observe the persistence of this issue across different step size $N_S$ for FTL. Notation: #: number, N'th interval: the N'th 50 iters.
Figure 3: The figure displays the core idea of our method ATT in comparison to traditional LDD methods employing FTL. Left: The Vanilla LDD bypasses all possible predictions and matches the inaccurate prediction. Right: our method ATT adopts an adaptive approach, aligning predictions with expert network parameters using a minimum distance policy. ATT avoid cases where compressing or stretching trajectories happen, and prevents the accumulation of errors resulting from those cases within the synthetic dataset at each iteration, thus achieves better distillation performances.
Figure 4: Left: This plot illustrates ATT's step selection during the distillation phase, highlighting its preference for smaller steps initially, contributing to enhanced stability during parameter tuning. Right: The plot demonstrates ATT's superior stability in varying parameters, showcasing the number of successful cases when alter parameter to mutiplier times. Different multipliers are presented as distinct cases.
Figure 5: The figure illustrates the impact of the tolerance parameter, $\gamma$, on the performance of FTL. As depicted, the number of iterations exhibits fluctuations under 50, indicating the consistent presence of this phenomenon throughout the entire distillation process. Consequently, it is evident that AMP persists even when a certain level of tolerance for mistakes, as represented by $\gamma$, is applied. This finding emphasizes the resilience of AMP under varying degrees of error tolerance, underscoring the importance of addressing and mitigating this phenomenon in distillation processes.
...and 5 more figures

Dataset Distillation by Automatic Training Trajectories

TL;DR

Abstract

Dataset Distillation by Automatic Training Trajectories

Authors

TL;DR

Abstract

Table of Contents

Figures (10)