Table of Contents
Fetching ...

Dataset Distillation via Committee Voting

Jiacheng Cui, Zhaoyi Li, Xiaochen Ma, Xinyue Bi, Yaxin Luo, Zhiqiang Shen

TL;DR

CV-DD introduces Committee Voting for Dataset Distillation, a multi-model, prior-performance guided framework that synthesizes high-quality synthetic data by weighting model contributions according to prior performance and generating batch-specific soft labels. It combines a strong baseline with a diverse backbone committee and a data-driven voting strategy to improve generalization and reduce overfitting, achieving state-of-the-art results across CIFAR, Tiny-ImageNet, and ImageNet-1K under varying IPC settings. The approach also emphasizes efficiency, cross-architecture robustness, and BN-statistic-aware soft labeling, enabling reliable performance even in data-limited scenarios. Overall, CV-DD offers a scalable, robust solution for distilling datasets that preserves essential information while reducing computational cost.

Abstract

Dataset distillation aims to synthesize a smaller, representative dataset that preserves the essential properties of the original data, enabling efficient model training with reduced computational resources. Prior work has primarily focused on improving the alignment or matching process between original and synthetic data, or on enhancing the efficiency of distilling large datasets. In this work, we introduce ${\bf C}$ommittee ${\bf V}$oting for ${\bf D}$ataset ${\bf D}$istillation (CV-DD), a novel and orthogonal approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets. We start by showing how to establish a strong baseline that already achieves state-of-the-art accuracy through leveraging recent advancements and thoughtful adjustments in model design and optimization processes. By integrating distributions and predictions from a committee of models while generating high-quality soft labels, our method captures a wider spectrum of data features, reduces model-specific biases and the adverse effects of distribution shifts, leading to significant improvements in generalization. This voting-based strategy not only promotes diversity and robustness within the distilled dataset but also significantly reduces overfitting, resulting in improved performance on post-eval tasks. Extensive experiments across various datasets and IPCs (images per class) demonstrate that Committee Voting leads to more reliable and adaptable distilled data compared to single/multi-model distillation methods, demonstrating its potential for efficient and accurate dataset distillation. Code is available at: https://github.com/Jiacheng8/CV-DD.

Dataset Distillation via Committee Voting

TL;DR

CV-DD introduces Committee Voting for Dataset Distillation, a multi-model, prior-performance guided framework that synthesizes high-quality synthetic data by weighting model contributions according to prior performance and generating batch-specific soft labels. It combines a strong baseline with a diverse backbone committee and a data-driven voting strategy to improve generalization and reduce overfitting, achieving state-of-the-art results across CIFAR, Tiny-ImageNet, and ImageNet-1K under varying IPC settings. The approach also emphasizes efficiency, cross-architecture robustness, and BN-statistic-aware soft labeling, enabling reliable performance even in data-limited scenarios. Overall, CV-DD offers a scalable, robust solution for distilling datasets that preserves essential information while reducing computational cost.

Abstract

Dataset distillation aims to synthesize a smaller, representative dataset that preserves the essential properties of the original data, enabling efficient model training with reduced computational resources. Prior work has primarily focused on improving the alignment or matching process between original and synthetic data, or on enhancing the efficiency of distilling large datasets. In this work, we introduce ommittee oting for ataset istillation (CV-DD), a novel and orthogonal approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets. We start by showing how to establish a strong baseline that already achieves state-of-the-art accuracy through leveraging recent advancements and thoughtful adjustments in model design and optimization processes. By integrating distributions and predictions from a committee of models while generating high-quality soft labels, our method captures a wider spectrum of data features, reduces model-specific biases and the adverse effects of distribution shifts, leading to significant improvements in generalization. This voting-based strategy not only promotes diversity and robustness within the distilled dataset but also significantly reduces overfitting, resulting in improved performance on post-eval tasks. Extensive experiments across various datasets and IPCs (images per class) demonstrate that Committee Voting leads to more reliable and adaptable distilled data compared to single/multi-model distillation methods, demonstrating its potential for efficient and accurate dataset distillation. Code is available at: https://github.com/Jiacheng8/CV-DD.
Paper Structure (31 sections, 9 equations, 22 figures, 19 tables, 1 algorithm)

This paper contains 31 sections, 9 equations, 22 figures, 19 tables, 1 algorithm.

Figures (22)

  • Figure 1: Top illustrates the motivation of our committee voting-based dataset distillation, highlighting its ability to reduce bias from individual model knowledge. Bottom shows the performance improvement over previous state-of-the-art method RDED RDED_2024.
  • Figure 2: Overview of CV-DD. The process begins with Data Initialization to generate synthetic data from the original data distribution. In Voting Strategy section, a committee of models collectively decides on the distributions for synthetic data, where the voting mechanism considers prior performance and calculates a weighted gradient update based on each model's distribution and prediction. Batch-Specific Soft Labeling generates soft labels tailored to small batch sizes by embedding batch norm statistics from synthetic data batch. Finally, a Smoothed Learning Rate strategy is applied to the post-training process, adjusting dynamically with a cosine schedule to stabilize training.
  • Figure 3: Performance comparison between the original SRe$^2$L and the enhanced SRe$^2$L++ baseline across five datasets with IPC=10 during the post-evaluation stage.
  • Figure 4: Illustration of the average cosine similarity (lower is the better) between feature embeddings of pairwise samples within the same class on ImageNet-1K with IPC=10.
  • Figure 5: Feature-level statistical discrepancies between synthetic data generated by SRe$^2$L++ and the training data on ImageNet-1K, evaluated across different batches in a pre-trained ResNet18 model.
  • ...and 17 more figures