Table of Contents
Fetching ...

Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory

Wenliang Zhong, Haoyu Tang, Qinghai Zheng, Mingzhu Xu, Yupeng Hu, Liqiang Nie

TL;DR

This work targets the inefficiencies of Multi-step Trajectory Matching (MTT) in dataset distillation by introducing Matching Convexified Trajectory (MCT). MCT replaces the non-convex expert training path with a convex, linearly interpolated trajectory guided by Neural Tangent Kernel (NTK) dynamics, enabling continuous sampling and dramatically reducing memory requirements. Empirical results across CIFAR-10, CIFAR-100, and Tiny-ImageNet show that MCT improves convergence speed, stabilizes training, and lowers storage costs relative to MTT and other baselines. The approach promises more scalable and data-efficient distillation, particularly for larger models where trajectory storage becomes prohibitive.

Abstract

The rapid evolution of deep learning and large language models has led to an exponential growth in the demand for training data, prompting the development of Dataset Distillation methods to address the challenges of managing large datasets. Among these, Matching Training Trajectories (MTT) has been a prominent approach, which replicates the training trajectory of an expert network on real data with a synthetic dataset. However, our investigation found that this method suffers from three significant limitations: 1. Instability of expert trajectory generated by Stochastic Gradient Descent (SGD); 2. Low convergence speed of the distillation process; 3. High storage consumption of the expert trajectory. To address these issues, we offer a new perspective on understanding the essence of Dataset Distillation and MTT through a simple transformation of the objective function, and introduce a novel method called Matching Convexified Trajectory (MCT), which aims to provide better guidance for the student trajectory. MCT leverages insights from the linearized dynamics of Neural Tangent Kernel methods to create a convex combination of expert trajectories, guiding the student network to converge rapidly and stably. This trajectory is not only easier to store, but also enables a continuous sampling strategy during distillation, ensuring thorough learning and fitting of the entire expert trajectory. Comprehensive experiments across three public datasets validate the superiority of MCT over traditional MTT methods.

Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory

TL;DR

This work targets the inefficiencies of Multi-step Trajectory Matching (MTT) in dataset distillation by introducing Matching Convexified Trajectory (MCT). MCT replaces the non-convex expert training path with a convex, linearly interpolated trajectory guided by Neural Tangent Kernel (NTK) dynamics, enabling continuous sampling and dramatically reducing memory requirements. Empirical results across CIFAR-10, CIFAR-100, and Tiny-ImageNet show that MCT improves convergence speed, stabilizes training, and lowers storage costs relative to MTT and other baselines. The approach promises more scalable and data-efficient distillation, particularly for larger models where trajectory storage becomes prohibitive.

Abstract

The rapid evolution of deep learning and large language models has led to an exponential growth in the demand for training data, prompting the development of Dataset Distillation methods to address the challenges of managing large datasets. Among these, Matching Training Trajectories (MTT) has been a prominent approach, which replicates the training trajectory of an expert network on real data with a synthetic dataset. However, our investigation found that this method suffers from three significant limitations: 1. Instability of expert trajectory generated by Stochastic Gradient Descent (SGD); 2. Low convergence speed of the distillation process; 3. High storage consumption of the expert trajectory. To address these issues, we offer a new perspective on understanding the essence of Dataset Distillation and MTT through a simple transformation of the objective function, and introduce a novel method called Matching Convexified Trajectory (MCT), which aims to provide better guidance for the student trajectory. MCT leverages insights from the linearized dynamics of Neural Tangent Kernel methods to create a convex combination of expert trajectories, guiding the student network to converge rapidly and stably. This trajectory is not only easier to store, but also enables a continuous sampling strategy during distillation, ensuring thorough learning and fitting of the entire expert trajectory. Comprehensive experiments across three public datasets validate the superiority of MCT over traditional MTT methods.
Paper Structure (17 sections, 12 equations, 5 figures, 3 tables)

This paper contains 17 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a): PCA projection of all waypoints model in the expert trajectory, where z-axis represents the value of $(1-\text{validation accuracy})$; (b): The required iteration number to convergence for both the MCT and MTT methods during distillation. The convergence is defined by the condition where the difference between the accuracy at any iteration and the maximum accuracy is less than $\epsilon$=2%.
  • Figure 2: An illustration of the proposed MCT method. The left figure illustrates a schematic of the landscape in the model parameter space, while the right figure shows the validation accuracy of waypoint models extracted from expert trajectories of both the MTT method and our MCT method. In the left figure, the original trajectory $\tau_\text{mtt}$ exhibits constant oscillations, causing $\Vec{V}^{(t)}_{\mathcal{T}}$ to continuously change, resulting in fluctuating accuracy of the expert model in the right figure. In contrast, the trajectory $\tau_\text{conv}$ of our MCT method is very stable, thereby ensuring a consistent guidance direction, which leads to a steady improvement of the expert model as shown in the right figure.
  • Figure 3: (a) and (b): Convergence comparisons of distillation process on CIFAR-10 and CIFAR-100, where the symbol "star" denotes the convergence point. (c): Storage comparisons on three datasets.
  • Figure 4: Effects of Continuous Sampling over iterations with different expert trajectory numbers.
  • Figure 5: Visualization of synthetic dataset.