Table of Contents
Fetching ...

Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching

Ruonan Yu, Songhua Liu, Jingwen Ye, Xinchao Wang

TL;DR

Dataset distillation aims to condense large datasets into small synthetic ones while preserving performance, but existing bilevel training pipelines are memory- and compute-heavy. Teddy introduces Taylor-approximated matching to convert the inner-loop optimization into a first-order objective and uses a pre-cached pool of weak teachers to avoid retraining per update. The authors provide a theoretical framework linking mainstream DD objectives and demonstrate that gradient/feature-statistics matching suffices under this approximation. Empirically, Teddy achieves state-of-the-art efficiency and accuracy on Tiny-ImageNet and full ImageNet-1K, with substantial runtime reductions and strong cross-architecture generalization.

Abstract

Dataset distillation or condensation refers to compressing a large-scale dataset into a much smaller one, enabling models trained on this synthetic dataset to generalize effectively on real data. Tackling this challenge, as defined, relies on a bi-level optimization algorithm: a novel model is trained in each iteration within a nested loop, with gradients propagated through an unrolled computation graph. However, this approach incurs high memory and time complexity, posing difficulties in scaling up to large datasets such as ImageNet. Addressing these concerns, this paper introduces Teddy, a Taylor-approximated dataset distillation framework designed to handle large-scale dataset and enhance efficiency. On the one hand, backed up by theoretical analysis, we propose a memory-efficient approximation derived from Taylor expansion, which transforms the original form dependent on multi-step gradients to a first-order one. On the other hand, rather than repeatedly training a novel model in each iteration, we unveil that employing a pre-cached pool of weak models, which can be generated from a single base model, enhances both time efficiency and performance concurrently, particularly when dealing with large-scale datasets. Extensive experiments demonstrate that the proposed Teddy attains state-of-the-art efficiency and performance on the Tiny-ImageNet and original-sized ImageNet-1K dataset, notably surpassing prior methods by up to 12.8%, while reducing 46.6% runtime. Our code will be available at https://github.com/Lexie-YU/Teddy.

Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching

TL;DR

Dataset distillation aims to condense large datasets into small synthetic ones while preserving performance, but existing bilevel training pipelines are memory- and compute-heavy. Teddy introduces Taylor-approximated matching to convert the inner-loop optimization into a first-order objective and uses a pre-cached pool of weak teachers to avoid retraining per update. The authors provide a theoretical framework linking mainstream DD objectives and demonstrate that gradient/feature-statistics matching suffices under this approximation. Empirically, Teddy achieves state-of-the-art efficiency and accuracy on Tiny-ImageNet and full ImageNet-1K, with substantial runtime reductions and strong cross-architecture generalization.

Abstract

Dataset distillation or condensation refers to compressing a large-scale dataset into a much smaller one, enabling models trained on this synthetic dataset to generalize effectively on real data. Tackling this challenge, as defined, relies on a bi-level optimization algorithm: a novel model is trained in each iteration within a nested loop, with gradients propagated through an unrolled computation graph. However, this approach incurs high memory and time complexity, posing difficulties in scaling up to large datasets such as ImageNet. Addressing these concerns, this paper introduces Teddy, a Taylor-approximated dataset distillation framework designed to handle large-scale dataset and enhance efficiency. On the one hand, backed up by theoretical analysis, we propose a memory-efficient approximation derived from Taylor expansion, which transforms the original form dependent on multi-step gradients to a first-order one. On the other hand, rather than repeatedly training a novel model in each iteration, we unveil that employing a pre-cached pool of weak models, which can be generated from a single base model, enhances both time efficiency and performance concurrently, particularly when dealing with large-scale datasets. Extensive experiments demonstrate that the proposed Teddy attains state-of-the-art efficiency and performance on the Tiny-ImageNet and original-sized ImageNet-1K dataset, notably surpassing prior methods by up to 12.8%, while reducing 46.6% runtime. Our code will be available at https://github.com/Lexie-YU/Teddy.

Paper Structure

This paper contains 34 sections, 3 theorems, 22 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

The meta-learning-based optimization objective can be Taylor-approximated as the sum of the gradient matching of the distilled data and the original data for all steps along the training trajectory of the student model.

Figures (7)

  • Figure 1: Illustration of meta-learning-based methods, our method (left), and the comparison of memory and time efficiency (right). Our proposed method exhibits surprising memory and time efficiency.
  • Figure 2: (a)(b) Illustration of our proposed method. Firstly, we generate the model pool by prior or post-generation. Then, update the initialized distilled data via statistic and label matching. Lastly, input the generated distilled data with augmentation into the model pool again to obtain the soft label. (c) The left figure shows the distance between student and teacher models at different stages. The dotted line represents the distance between a single teacher and a student. The solid line represents the average of the distances within the range. The right figure shows the performance of distilled data generated from teacher models at different stages.
  • Figure 3: (a) Ablation study on different number of models ensemble to generate synthetic data. (b) Ablation study on different number of models ensemble to generate soft label. (c) Ablation study on size of the model pool under the setting of ImageNet-1K IPC 10 and IPC 50 with the prior-generation strategy. (d) The time and memory requirement of our method compared with the previous SOTA. Here the size of the points represents the peak GPU memory, and the three points, from left to right, report the evaluation results of 1, 2, and 3 teacher models ensemble utilized in generating the synthetic data. (e) Continual learning on Tiny-ImageNet IPC 50 with 5-step incremental protocol.
  • Figure 4: Visualization of synthetic data generated by SRe$^2$L and our method. The first row is generated by SRe$^2$L, and the second row is generated by ours with the prior-generation strategy. Here, we choose three classes from ImageNet-1K: Giant Panda, Bulbul, and Airship.
  • Figure 5: Left: training loss of the original DD (dashed line) and our approximated objectives (solid line); Right: the difference of average loss, average accuracy and peak accuracy during training.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof