Self-supervised Dataset Distillation: A Good Compression Is All You Need

Muxin Zhou; Zeyuan Yin; Shitong Shao; Zhiqiang Shen

Self-supervised Dataset Distillation: A Good Compression Is All You Need

Muxin Zhou, Zeyuan Yin, Shitong Shao, Zhiqiang Shen

TL;DR

The paper addresses the bottleneck in dataset distillation where large recovery models under supervised pretraining fail to retain useful information due to flattened BatchNorm statistics, formalized as $H(\Theta_{ssl})>H(\Theta_{sl})$. It proposes Self-supervised Compression for Dataset Distillation (SC-DD), a decoupled approach that freezes a SSL pretrained backbone and adds a BN-matching objective $\mathcal{L}_{BN}$ alongside $\mathcal{L}_{CE}$, plus an imbalanced BN statistic distribution matching term to amplify informative signals. Across CIFAR-100, Tiny-ImageNet, and ImageNet-1K, SC-DD achieves state-of-the-art results and shows positive scalability with larger recovery models, outperforming SRe$^2$L, MTT, TESLA, and others (e.g., CIFAR-100 IPC 50: $53.4\%$, Tiny-ImageNet IPC 50: $45.9\%$, ImageNet-1K IPC 50: $60.9\%$ with appropriate architectures). The method enables practical benefits such as data-free pruning and demonstrates the value of SSL-based representations for large-scale dataset distillation, with code available for reproducibility.

Abstract

Dataset distillation aims to compress information from a large-scale original dataset to a new compact dataset while striving to preserve the utmost degree of the original data informational essence. Previous studies have predominantly concentrated on aligning the intermediate statistics between the original and distilled data, such as weight trajectory, features, gradient, BatchNorm, etc. In this work, we consider addressing this task through the new lens of model informativeness in the compression stage on the original dataset pretraining. We observe that with the prior state-of-the-art SRe$^2$L, as model sizes increase, it becomes increasingly challenging for supervised pretrained models to recover learned information during data synthesis, as the channel-wise mean and variance inside the model are flatting and less informative. We further notice that larger variances in BN statistics from self-supervised models enable larger loss signals to update the recovered data by gradients, enjoying more informativeness during synthesis. Building on this observation, we introduce SC-DD, a simple yet effective Self-supervised Compression framework for Dataset Distillation that facilitates diverse information compression and recovery compared to traditional supervised learning schemes, further reaps the potential of large pretrained models with enhanced capabilities. Extensive experiments are conducted on CIFAR-100, Tiny-ImageNet and ImageNet-1K datasets to demonstrate the superiority of our proposed approach. The proposed SC-DD outperforms all previous state-of-the-art supervised dataset distillation methods when employing larger models, such as SRe$^2$L, MTT, TESLA, DC, CAFE, etc., by large margins under the same recovery and post-training budgets. Code is available at https://github.com/VILA-Lab/SRe2L/tree/main/SCDD/.

Self-supervised Dataset Distillation: A Good Compression Is All You Need

TL;DR

. It proposes Self-supervised Compression for Dataset Distillation (SC-DD), a decoupled approach that freezes a SSL pretrained backbone and adds a BN-matching objective

alongside

, plus an imbalanced BN statistic distribution matching term to amplify informative signals. Across CIFAR-100, Tiny-ImageNet, and ImageNet-1K, SC-DD achieves state-of-the-art results and shows positive scalability with larger recovery models, outperforming SRe

L, MTT, TESLA, and others (e.g., CIFAR-100 IPC 50:

, Tiny-ImageNet IPC 50:

, ImageNet-1K IPC 50:

with appropriate architectures). The method enables practical benefits such as data-free pruning and demonstrates the value of SSL-based representations for large-scale dataset distillation, with code available for reproducibility.

Abstract

L, as model sizes increase, it becomes increasingly challenging for supervised pretrained models to recover learned information during data synthesis, as the channel-wise mean and variance inside the model are flatting and less informative. We further notice that larger variances in BN statistics from self-supervised models enable larger loss signals to update the recovered data by gradients, enjoying more informativeness during synthesis. Building on this observation, we introduce SC-DD, a simple yet effective Self-supervised Compression framework for Dataset Distillation that facilitates diverse information compression and recovery compared to traditional supervised learning schemes, further reaps the potential of large pretrained models with enhanced capabilities. Extensive experiments are conducted on CIFAR-100, Tiny-ImageNet and ImageNet-1K datasets to demonstrate the superiority of our proposed approach. The proposed SC-DD outperforms all previous state-of-the-art supervised dataset distillation methods when employing larger models, such as SRe

L, MTT, TESLA, DC, CAFE, etc., by large margins under the same recovery and post-training budgets. Code is available at https://github.com/VILA-Lab/SRe2L/tree/main/SCDD/.

Paper Structure (21 sections, 1 theorem, 19 equations, 13 figures, 20 tables)

This paper contains 21 sections, 1 theorem, 19 equations, 13 figures, 20 tables.

Introduction
Approach
Understanding Dataset Compression
Model-Data Alignment
Imbalanced BN Statistic Distribution Matching
A Simple DD Framework via Self-supervised Pretraining
Post-training for Validation
Experiments
Datasets and Implementation Details
Comparison with State-of-the-art Approaches
Ablation
Analysis
Application: Data-free Pruning
Related Work
Conclusion
...and 6 more sections

Key Result

theorem thmcountertheorem

Batch Normalization statistical parameters $\Theta$ (mean $\mu$ and variance $\sigma^2$) derived from self-supervised contrastive learning are more fluctuant than those from supervised learning, which is more informative for dataset distillation recovery of image synthesis with higher entropy, i.e.,

Figures (13)

Figure 1: Example distilled images from SRe$^2$L yin2023squeeze and our 64$\times$64 Tiny-ImageNet (top two rows), 224$\times$224 ImageNet-1K (bottom two rows). All our synthetic data is generated from the self-supervised pretrained models, while the more realistic images with better semantic alignment and details are obtained. Moreover, training a conventional deep model with our distilled images results in a model that achieves test accuracy on the original validation data markedly superior to previous dataset distillation methods. More visualization results are available at https://drive.google.com/file/d/1uQgGPx36WkBH-qh6iTpx2H2Dn86qXRAR/view?usp=sharing.
Figure 2: Top-1 accuracy of SRe$^2$L yin2023squeeze and our approach on full ImageNet-1K with recovery model scales from small to large. The recovery budget is 1$k$ iterations. Each curve presents the post validation on ResNet-{18, 50, 101} and RegNet-x-8gf.
Figure 3: Overview of our learning paradigm. The top-left subfigure is the paradigm of supervised pertaining with an end-to-end training scheme for both the backbone network and final alignment classifier. The bottom-left subfigure is the paradigm of our proposed procedure for dataset distillation: a backbone model is first pretrained using a self-supervised objective, then a linear probing layer is adjusted to align the distribution of pertaining and target dataset distribution. We do not fine-tune the backbone during the alignment phase to preserve the better intermediate distributions of mean and variance in batch normalization layers (illustrated in the middle yellow line chart of the figure). The bottom-middle subfigure is the data synthesis procedure and the left subfigure is the visualization of distilled images.
Figure 4: Illustration of mean (left) and variance (right) of the first BN layer in the residual block from self-supervised MoCo-v3-ResNet-50, supervised ResNet-{18, 50, 101}. In each subfigure, the x-axis represents the channel index, y-axis represents the corresponding value. The table inside each subfigure represents the variance across all channels, which reflects the fluctuation of statistics in the BN layer.
Figure 5: Loss trajectories during data synthesis. Left subfigure illustrates the BN loss term and right subfigure illustrates the CE loss term. The backbone is ResNet-18 for both self-supervised and supervised training schemes.
...and 8 more figures

Theorems & Definitions (1)

theorem thmcountertheorem

Self-supervised Dataset Distillation: A Good Compression Is All You Need

TL;DR

Abstract

Self-supervised Dataset Distillation: A Good Compression Is All You Need

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (1)