Improving the Scaling Laws of Synthetic Data with Deliberate Practice
Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano
TL;DR
The paper tackles the inefficiency of scaling synthetic data for visual recognition tasks due to diminishing returns when simply increasing dataset size. It introduces Deliberate Practice for Synthetic Data Generation (DP), a dynamic loop that couples a diffusion-based generator with a learning model, guiding data generation by the learner's prediction entropy to produce informative, hard examples. The authors provide a theoretical random-matrix theory analysis showing improved test-error scaling when training on such informative samples and validate DP empirically on ImageNet-100 and ImageNet-1k, achieving up to 3.4x fewer samples and 6x fewer iterations on ImageNet-100 and 8x fewer samples with 30% fewer iterations on ImageNet-1k, while out-performing prior work and improving OOD results. This work demonstrates that adaptive, entropy-guided synthetic data generation can dramatically reduce data and compute requirements while delivering superior generalization, suggesting a viable path toward scalable, data-efficient synthetic-data pipelines.
Abstract
Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.
