Table of Contents
Fetching ...

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano

TL;DR

The paper tackles the inefficiency of scaling synthetic data for visual recognition tasks due to diminishing returns when simply increasing dataset size. It introduces Deliberate Practice for Synthetic Data Generation (DP), a dynamic loop that couples a diffusion-based generator with a learning model, guiding data generation by the learner's prediction entropy to produce informative, hard examples. The authors provide a theoretical random-matrix theory analysis showing improved test-error scaling when training on such informative samples and validate DP empirically on ImageNet-100 and ImageNet-1k, achieving up to 3.4x fewer samples and 6x fewer iterations on ImageNet-100 and 8x fewer samples with 30% fewer iterations on ImageNet-1k, while out-performing prior work and improving OOD results. This work demonstrates that adaptive, entropy-guided synthetic data generation can dramatically reduce data and compute requirements while delivering superior generalization, suggesting a viable path toward scalable, data-efficient synthetic-data pipelines.

Abstract

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

TL;DR

The paper tackles the inefficiency of scaling synthetic data for visual recognition tasks due to diminishing returns when simply increasing dataset size. It introduces Deliberate Practice for Synthetic Data Generation (DP), a dynamic loop that couples a diffusion-based generator with a learning model, guiding data generation by the learner's prediction entropy to produce informative, hard examples. The authors provide a theoretical random-matrix theory analysis showing improved test-error scaling when training on such informative samples and validate DP empirically on ImageNet-100 and ImageNet-1k, achieving up to 3.4x fewer samples and 6x fewer iterations on ImageNet-100 and 8x fewer samples with 30% fewer iterations on ImageNet-1k, while out-performing prior work and improving OOD results. This work demonstrates that adaptive, entropy-guided synthetic data generation can dramatically reduce data and compute requirements while delivering superior generalization, suggesting a viable path toward scalable, data-efficient synthetic-data pipelines.

Abstract

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

Paper Structure

This paper contains 49 sections, 11 theorems, 95 equations, 13 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

In the limit Eq. eq:proportionate, the classification test error satisfies: $E_{test}(\hat{w}) \to \Phi\left(-m_0/\sqrt{\nu_0 - m_0^2}\right)$, where

Figures (13)

  • Figure 1: (Top): Conventional approaches generate (or collect) a massive static dataset and then select challenging examples in a one-time filtering step based on the learner’s selection criterion. This is inefficient, as most generated data is discarded. (Bottom): DP continuously generates only the most challenging examples based on continuous feedback from the learner, eliminating the need for large-scale data pruning. This iterative process ensures that training focuses on progressively informative examples, improving efficiency and performance. (Right): Top-1 validation accuracy on ImageNet-1k with models trained solely on synthetic data. DP (orange) achieves higher accuracy than the 13M synthetic data setup (blue) while using 10× fewer samples, significantly outperforming the 1.3M baseline (gray).
  • Figure 2: Training loss (left) and validation accuracy (right) of Deliberate Practice on ImageNet-100. The classifier begins training on an initial static dataset (130k samples) until validation accuracy plateaus. At this point, additional samples are generated using entropy-guided sampling, focusing on hard/informative examples. The two dashed vertical lines indicate points where new data is added. We compare three setups: (1) Orange: No additional data is added, training only on the initial dataset. (2) Purple: One round of entropy-guided data generation adds 130k samples. (3) Blue: Two rounds of entropy-guided data generation, adding 260k samples in total. Each data addition leads to an accuracy boost, demonstrating the effectiveness of DP in improving performance with fewer training iterations. For clarity, this figure shows only two rounds of data addition, but in practice, more rounds occur based on the allowed maximum patience. Notably, while training loss increases with new data, validation accuracy steadily improves, showing that the model benefits from progressively challenging examples, ultimately reducing the generalization gap.
  • Figure 3: Theoretical prediction for scaling behavior of accuracy (Theorem \ref{['thm:main']}) for a simple classifier in a $d=512$ dimensional input space, as a function of dataset selection strategy. The classifier is trained on synthetic data with different pruning probabilities, where higher pruning probability corresponds to keeping only the most challenging examples (those closer to the decision boundary). The results compare selecting all samples (gray) versus selecting a fraction of the hardest samples (red). Selecting harder examples improves sample efficiency, achieving higher accuracy with fewer training samples.
  • Figure 4: Scaling laws of synthetic data. Real Validation accuracy versus total dataset size for the Static (pink $\times$), and Deliberate Practice (blue o) setups on ImageNet-100 (left) and ImageNet-1k (right). DP significantly outperforms Static data generation, achieving higher accuracy with fewer synthetic examples. DP achieves the same accuracy as the static setup using 7.5$\times$ less data on ImageNet-100 and 20$\times$ less data while outperforming it on ImageNet-1K.
  • Figure 5: Plots describing the performance of DP compared to explicit pruning and theory prediction while changing the oversampling ratio or the DP coefficient. (a) Over-sampling with entropy-based selection – Generate a large pool of samples (ranging from 130k to 2.4M) and select the 130k highest-entropy examples. (b) Generate 130k high-entropy examples directly using DP with varying entropy guidance strength through $\omega$. (c) The theory prediction on the accuracy based on the over-sampling ration. (d) Comparing the compute cost of DP vs oversampling then pruning. We observe that DP exhibits a similar accuracy curve compared to explicit pruning and theoretical prediction when changing the over-sampling/DP coefficient. However, DP is computationally remarkably more efficient while gaining more accuracy delta.
  • ...and 8 more figures

Theorems & Definitions (15)

  • Theorem 1
  • Corollary 1
  • Lemma 1
  • Lemma 2
  • Definition 1: Deterministic Equivalents
  • Proposition 1
  • Lemma 3
  • proof
  • Corollary 2
  • Remark 1
  • ...and 5 more