PRISM: Diversifying Dataset Distillation by Decoupling Architectural Priors
Brian B. Moser, Shalini Sarode, Federico Raue, Stanislav Frolov, Krzysztof Adamkiewicz, Arundhati Shanbhag, Joachim Folz, Tobias C. Nauen, Andreas Dengel
TL;DR
PRISM tackles the lack of intra-class diversity in dataset distillation by decoupling architectural priors: logit supervision is handled by a primary teacher, while BN alignment is supervised by a diverse set of BN teachers. This dual- and multi-teacher alignment introduces multiple world views into synthesis, yielding richer, more generalizable synthetic data on ImageNet-1K and achieving state-of-the-art results at higher IPCs. The work demonstrates that diversification through architectural priors improves both performance and diversity, and it provides scalable batch formation and teacher-selection strategies, along with thorough ablations on recovery and post-recovery steps. Collectively, PRISM establishes architectural decoupling as an orthogonal, scalable axis for advancing dataset distillation toward robust, privacy-preserving, large-scale applications.
Abstract
Dataset distillation (DD) promises compact yet faithful synthetic data, but existing approaches often inherit the inductive bias of a single teacher model. As dataset size increases, this bias drives generation toward overly smooth, homogeneous samples, reducing intra-class diversity and limiting generalization. We present PRISM (PRIors from diverse Source Models), a framework that disentangles architectural priors during synthesis. PRISM decouples the logit-matching and regularization objectives, supervising them with different teacher architectures: a primary model for logits and a stochastic subset for batch-normalization (BN) alignment. On ImageNet-1K, PRISM consistently and reproducibly outperforms single-teacher methods (e.g., SRe2L) and recent multi-teacher variants (e.g., G-VBSM) at low- and mid-IPC regimes. The generated data also show significantly richer intra-class diversity, as reflected by a notable drop in cosine similarity between features. We further analyze teacher selection strategies (pre- vs. intra-distillation) and introduce a scalable cross-class batch formation scheme for fast parallel synthesis. Code will be released after the review period.
