SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

Bowen Yuan; Yuxia Fu; Zijian Wang; Yadan Luo; Zi Huang

SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

Bowen Yuan, Yuxia Fu, Zijian Wang, Yadan Luo, Zi Huang

TL;DR

Dataset condensation for large-scale data suffers from storage overhead of soft labels and limited scalability. SCORE introduces a coding-rate–based min-max objective balancing informativeness, discriminativeness, and compressibility to select realistic, informative samples while enabling strong soft-label compression via RPCA. The method achieves state-of-the-art performance on ImageNet-1K and Tiny-ImageNet under tight storage budgets, with substantial reductions in soft-label storage (e.g., 1.9 GB at 50 IPC) and only modest accuracy losses under high compression. The work provides a scalable, information-theoretic framework for condensing large datasets with robust cross-architecture generalization.

Abstract

Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding knowledge into realistic images with soft labeling, for their scalability to ImageNet-scale datasets and strong capability of cross-domain generalization. However, this strong performance comes at a substantial storage cost which could significantly exceed the storage cost of the original dataset. We argue that the three key properties to alleviate this performance-storage dilemma are informativeness, discriminativeness, and compressibility of the condensed data. Towards this end, this paper proposes a \textbf{S}oft label compression-centric dataset condensation framework using \textbf{CO}ding \textbf{R}at\textbf{E} (SCORE). SCORE formulates dataset condensation as a min-max optimization problem, which aims to balance the three key properties from an information-theoretic perspective. In particular, we theoretically demonstrate that our coding rate-inspired objective function is submodular, and its optimization naturally enforces low-rank structure in the soft label set corresponding to each condensed data. Extensive experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases. Even with 30$\times$ compression of soft labels, performance decreases by only 5.5\% and 2.7\% for ImageNet-1K with IPC 10 and 50, respectively. Code will be released upon paper acceptance.

SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

TL;DR

Abstract

SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (5)