Table of Contents
Fetching ...

Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation

Muquan Li, Hang Gou, Dongyang Zhang, Shuang Liang, Xiurui Xie, Deqiang Ouyang, Ke Qin

TL;DR

This work tackles the inefficiency of fixed/random inner-loop truncation in dataset distillation by introducing Automatic Truncated Backpropagation Through Time (AT-BPTT), which aligns truncation with gradient dynamics across training stages. The framework combines stage-aware dynamic truncation, gradient-variation–driven adaptive windowing, and a low-rank Hessian approximation to accelerate inner-loop updates, with an additional patch-wise semantic preservation module to extend effectiveness to high-resolution data. Empirical results on CIFAR-10/100, Tiny-ImageNet, and ImageNet-1K show state-of-the-art distillation performance (average improvement of $+6.16 omannumeral7$ over RaT-BPTT) and substantial computational savings (up to $3.9\times$ speed and $63\%$ memory reduction), with strong ablations validating each component. The approach offers a general, scalable solution for efficient bilevel optimization in dataset distillation and related inner-loop problems, with potential extensions to recurrent architectures and federated learning.

Abstract

The growing demand for efficient deep learning has positioned dataset distillation as a pivotal technique for compressing training dataset while preserving model performance. However, existing inner-loop optimization methods for dataset distillation typically rely on random truncation strategies, which lack flexibility and often yield suboptimal results. In this work, we observe that neural networks exhibit distinct learning dynamics across different training stages-early, middle, and late-making random truncation ineffective. To address this limitation, we propose Automatic Truncated Backpropagation Through Time (AT-BPTT), a novel framework that dynamically adapts both truncation positions and window sizes according to intrinsic gradient behavior. AT-BPTT introduces three key components: (1) a probabilistic mechanism for stage-aware timestep selection, (2) an adaptive window sizing strategy based on gradient variation, and (3) a low-rank Hessian approximation to reduce computational overhead. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods. Moreover, our approach accelerates inner-loop optimization by 3.9x while saving 63% memory cost.

Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation

TL;DR

This work tackles the inefficiency of fixed/random inner-loop truncation in dataset distillation by introducing Automatic Truncated Backpropagation Through Time (AT-BPTT), which aligns truncation with gradient dynamics across training stages. The framework combines stage-aware dynamic truncation, gradient-variation–driven adaptive windowing, and a low-rank Hessian approximation to accelerate inner-loop updates, with an additional patch-wise semantic preservation module to extend effectiveness to high-resolution data. Empirical results on CIFAR-10/100, Tiny-ImageNet, and ImageNet-1K show state-of-the-art distillation performance (average improvement of over RaT-BPTT) and substantial computational savings (up to speed and memory reduction), with strong ablations validating each component. The approach offers a general, scalable solution for efficient bilevel optimization in dataset distillation and related inner-loop problems, with potential extensions to recurrent architectures and federated learning.

Abstract

The growing demand for efficient deep learning has positioned dataset distillation as a pivotal technique for compressing training dataset while preserving model performance. However, existing inner-loop optimization methods for dataset distillation typically rely on random truncation strategies, which lack flexibility and often yield suboptimal results. In this work, we observe that neural networks exhibit distinct learning dynamics across different training stages-early, middle, and late-making random truncation ineffective. To address this limitation, we propose Automatic Truncated Backpropagation Through Time (AT-BPTT), a novel framework that dynamically adapts both truncation positions and window sizes according to intrinsic gradient behavior. AT-BPTT introduces three key components: (1) a probabilistic mechanism for stage-aware timestep selection, (2) an adaptive window sizing strategy based on gradient variation, and (3) a low-rank Hessian approximation to reduce computational overhead. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods. Moreover, our approach accelerates inner-loop optimization by 3.9x while saving 63% memory cost.

Paper Structure

This paper contains 30 sections, 22 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Hypothesis verification for the influence of truncation strategies and window size. (a)(b)(c) show experiments where the preliminary or post truncation positions are implemented at early, middle and late stages, respectively, and (d)(e)(f) present experiments where the window size is changed after fixing the truncation position. For example, Early-Preliminary in (a) means that randomly select preliminary phase (0-100) timesteps in early training stage (0-200 epochs).
  • Figure 2: Illustration of the gradient and gradient variation average magnitudes each timestep during training process. The entire timesteps are roughly averaged into preliminary and post phases.
  • Figure 3: Overall framework of our proposed AT-BPTT. The distilled data flows through our patch-wise semantic preservation module in the inner-loop optimization. The dynamic truncation position and adaptive window size then jointly optimize inner-loop training dynamics. The low-rank Hessian approximation is utilized to reduce computational cost.
  • Figure 4: Comparison of performance, GPU memory usage, and speedup between the SOTA DD methods and our AT-BPTT.
  • Figure 5: Ablation study for the stage transition threshold. The left and right matrices reflect the effect of $X$ and $M$, and $Y$ and $N$ on the accuracy, respectively. Darker colored squares indicate higher accuracy under the synergistic influence of horizontal and vertical coordinates.
  • ...and 3 more figures