Table of Contents
Fetching ...

Multisize Dataset Condensation

Yang He, Lingao Xiao, Joey Tianyi Zhou, Ivor Tsang

TL;DR

This paper tackles the challenge of flexible, on-device dataset condensation by introducing Multisize Dataset Condensation (MDC), which compresses $N$ condensation processes into a single run. It adds an adaptive subset loss built around a Most Learnable Subset (MLS) selection to mitigate the subset degradation problem across all target subset sizes. MDC demonstrates consistent improvements across ConvNet, ResNet, and DenseNet on SVHN, CIFAR-10/100, and ImageNet-10, with notable gains at small IPC targets and substantial reductions in training time and storage. The results indicate practical benefits for on-device learning scenarios requiring multiple condensed sizes without incurring multiple condensation passes.

Abstract

While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 5.22%-6.40% average accuracy gains on condensing CIFAR-10 to ten images per class. Code is available at: https://github.com/he-y/Multisize-Dataset-Condensation.

Multisize Dataset Condensation

TL;DR

This paper tackles the challenge of flexible, on-device dataset condensation by introducing Multisize Dataset Condensation (MDC), which compresses condensation processes into a single run. It adds an adaptive subset loss built around a Most Learnable Subset (MLS) selection to mitigate the subset degradation problem across all target subset sizes. MDC demonstrates consistent improvements across ConvNet, ResNet, and DenseNet on SVHN, CIFAR-10/100, and ImageNet-10, with notable gains at small IPC targets and substantial reductions in training time and storage. The results indicate practical benefits for on-device learning scenarios requiring multiple condensed sizes without incurring multiple condensation passes.

Abstract

While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 5.22%-6.40% average accuracy gains on condensing CIFAR-10 to ten images per class. Code is available at: https://github.com/he-y/Multisize-Dataset-Condensation.
Paper Structure (27 sections, 10 equations, 15 figures, 12 tables, 1 algorithm)

This paper contains 27 sections, 10 equations, 15 figures, 12 tables, 1 algorithm.

Figures (15)

  • Figure 1: Condense datasets to multiple sizes requires $\boldsymbol{N}$ separate traditional condensation processes (left) but just a single MDC processes (right).
  • Figure 2: Three different baselines for multi-size condensation.
  • Figure 3: Explanation of Our MDC.
  • Figure 4: Cross-architecture performance of the proposed method. CIFAR-10, IPC$_{10}$.
  • Figure 5: MLS and frozen subsets visualization. CIFAR-10, IPC$_{10}$.
  • ...and 10 more figures