Distributional Dataset Distillation with Subtask Decomposition

Tian Qin; Zhiwei Deng; David Alvarez-Melis

Distributional Dataset Distillation with Subtask Decomposition

Tian Qin, Zhiwei Deng, David Alvarez-Melis

TL;DR

This paper distills dataset into a compact distributional representation that is more memory-efficient compared to prototype-based methods, and proposes Federated Distillation, which decomposes the dataset into subsets, distills them in parallel using sub-task experts and then re-aggregates them.

Abstract

What does a neural network learn when training from a task-specific dataset? Synthesizing this knowledge is the central idea behind Dataset Distillation, which recent work has shown can be used to compress large datasets into a small set of input-label pairs ($\textit{prototypes}$) that capture essential aspects of the original dataset. In this paper, we make the key observation that existing methods distilling into explicit prototypes are very often suboptimal, incurring in unexpected storage cost from distilled labels. In response, we propose $\textit{Distributional Dataset Distillation}$ (D3), which encodes the data using minimal sufficient per-class statistics and paired with a decoder, we distill dataset into a compact distributional representation that is more memory-efficient compared to prototype-based methods. To scale up the process of learning these representations, we propose $\textit{Federated distillation}$, which decomposes the dataset into subsets, distills them in parallel using sub-task experts and then re-aggregates them. We thoroughly evaluate our algorithm on a three-dimensional metric and show that our method achieves state-of-the-art results on TinyImageNet and ImageNet-1K. Specifically, we outperform the prior art by $6.9\%$ on ImageNet-1K under the storage budget of 2 images per class.

Distributional Dataset Distillation with Subtask Decomposition

TL;DR

Abstract

) that capture essential aspects of the original dataset. In this paper, we make the key observation that existing methods distilling into explicit prototypes are very often suboptimal, incurring in unexpected storage cost from distilled labels. In response, we propose

(D3), which encodes the data using minimal sufficient per-class statistics and paired with a decoder, we distill dataset into a compact distributional representation that is more memory-efficient compared to prototype-based methods. To scale up the process of learning these representations, we propose

, which decomposes the dataset into subsets, distills them in parallel using sub-task experts and then re-aggregates them. We thoroughly evaluate our algorithm on a three-dimensional metric and show that our method achieves state-of-the-art results on TinyImageNet and ImageNet-1K. Specifically, we outperform the prior art by

on ImageNet-1K under the storage budget of 2 images per class.

Paper Structure (31 sections, 5 equations, 10 figures, 14 tables)

This paper contains 31 sections, 5 equations, 10 figures, 14 tables.

Introduction
Related Work
Methodology
Three-Dimensional Evaluation
Distilling into distributions
Federated Distillation
Training Objective
Experiments
Quantitative Results
Latent Space Analysis
Federated Distillation
Ablation Study
Distributional Outcome
Loss Term Contribution
Distilled Labels
...and 16 more sections

Figures (10)

Figure 1: Three-dimensional evaluation on methods that scale to ImageNet-1K.Left: Recovery accuracy vs. storage trade-off comparison for our (D3) and other methods on resized ($64\times64\times3$) ImageNet-1K. Our method achieves SOTA performance at small memory cost regime. Right: Accuracy vs. downstream task training cost on resized ImagNet-1K.
Figure 2: Illustration of Federated Distillation and Distributional Representation We decompose large datasets into subtasks and distill each subset into distributions using locally trained experts. Distributions distilled on subtasks generalize well to the full task.
Figure 3: Visualization of distilled mean and variations for four classes from ImageNet-1Kfirst column: typical images for each class generated by passing the mean to the decoder to generate the sample. second column onwards: variations by sampling from the corresponding latent distribution
Figure 4: Visualization of the latent Gaussian space by interpolating priors for four classes from ImageNet-1K The four corners are the mean for each class. We linearly sample the Gaussian space between classes and pass into the decoder to generate interpolated images.
Figure 5: Ablation study on dual training objective and distributional representation Both distributional representation and dual training objective are essential for the performance of our method.
...and 5 more figures

Distributional Dataset Distillation with Subtask Decomposition

TL;DR

Abstract

Distributional Dataset Distillation with Subtask Decomposition

Authors

TL;DR

Abstract

Table of Contents

Figures (10)