Table of Contents
Fetching ...

SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

Bowen Yuan, Yuxia Fu, Zijian Wang, Yadan Luo, Zi Huang

TL;DR

Dataset condensation for large-scale data suffers from storage overhead of soft labels and limited scalability. SCORE introduces a coding-rate–based min-max objective balancing informativeness, discriminativeness, and compressibility to select realistic, informative samples while enabling strong soft-label compression via RPCA. The method achieves state-of-the-art performance on ImageNet-1K and Tiny-ImageNet under tight storage budgets, with substantial reductions in soft-label storage (e.g., 1.9 GB at 50 IPC) and only modest accuracy losses under high compression. The work provides a scalable, information-theoretic framework for condensing large datasets with robust cross-architecture generalization.

Abstract

Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding knowledge into realistic images with soft labeling, for their scalability to ImageNet-scale datasets and strong capability of cross-domain generalization. However, this strong performance comes at a substantial storage cost which could significantly exceed the storage cost of the original dataset. We argue that the three key properties to alleviate this performance-storage dilemma are informativeness, discriminativeness, and compressibility of the condensed data. Towards this end, this paper proposes a \textbf{S}oft label compression-centric dataset condensation framework using \textbf{CO}ding \textbf{R}at\textbf{E} (SCORE). SCORE formulates dataset condensation as a min-max optimization problem, which aims to balance the three key properties from an information-theoretic perspective. In particular, we theoretically demonstrate that our coding rate-inspired objective function is submodular, and its optimization naturally enforces low-rank structure in the soft label set corresponding to each condensed data. Extensive experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases. Even with 30$\times$ compression of soft labels, performance decreases by only 5.5\% and 2.7\% for ImageNet-1K with IPC 10 and 50, respectively. Code will be released upon paper acceptance.

SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

TL;DR

Dataset condensation for large-scale data suffers from storage overhead of soft labels and limited scalability. SCORE introduces a coding-rate–based min-max objective balancing informativeness, discriminativeness, and compressibility to select realistic, informative samples while enabling strong soft-label compression via RPCA. The method achieves state-of-the-art performance on ImageNet-1K and Tiny-ImageNet under tight storage budgets, with substantial reductions in soft-label storage (e.g., 1.9 GB at 50 IPC) and only modest accuracy losses under high compression. The work provides a scalable, information-theoretic framework for condensing large datasets with robust cross-architecture generalization.

Abstract

Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding knowledge into realistic images with soft labeling, for their scalability to ImageNet-scale datasets and strong capability of cross-domain generalization. However, this strong performance comes at a substantial storage cost which could significantly exceed the storage cost of the original dataset. We argue that the three key properties to alleviate this performance-storage dilemma are informativeness, discriminativeness, and compressibility of the condensed data. Towards this end, this paper proposes a \textbf{S}oft label compression-centric dataset condensation framework using \textbf{CO}ding \textbf{R}at\textbf{E} (SCORE). SCORE formulates dataset condensation as a min-max optimization problem, which aims to balance the three key properties from an information-theoretic perspective. In particular, we theoretically demonstrate that our coding rate-inspired objective function is submodular, and its optimization naturally enforces low-rank structure in the soft label set corresponding to each condensed data. Extensive experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases. Even with 30 compression of soft labels, performance decreases by only 5.5\% and 2.7\% for ImageNet-1K with IPC 10 and 50, respectively. Code will be released upon paper acceptance.

Paper Structure

This paper contains 23 sections, 3 theorems, 25 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Figures (9)

  • Figure 1: We compare the image quality and the corresponding model performance produced by four kinds of DC methods: Optimization-based, Generative model-based, Selection-based and Ours. Left: Visualization of different DC methods. Right: Performance comparison over different DC methods.
  • Figure 2: The comparison among various coreset selection methods on ImageWoof with IPC=10.
  • Figure 3: The comparison among soft label compression methods on ImageNet-1K with IPC=10.
  • Figure 4: Parameter sensitivity analysis of $\alpha$ and $\beta$ in \ref{['eq:overall selection']}.
  • Figure 5: The performance drop comparison across different methods.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Proposition 1: Fundamental properties of effective DC
  • Lemma 1: Coding rate function is a concave surrogate for the rank function logdet_heuristics
  • Lemma 2: Coding rate function is submodular
  • proof
  • proof