Table of Contents
Fetching ...

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping

Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li

TL;DR

This work addresses how to allocate a fixed computational budget across iterations in iterative synthetic-data bootstrapping for post-training. It develops a theoretical framework showing constant policies fail to converge with high probability, while increasing policies, particularly exponential growth, guarantee exponential convergence and often beat polynomial schemes in worst-case cost. The authors validate these claims with two experiments—image denoising using diffusion probabilistic models and math reasoning with large language models—where exponential growth policies consistently outperform constant policies and generally surpass linear growth in stability and final performance. The results provide principled guidance for resource-efficient post-training, with potential extensions to RLHF and broader iterative training settings. The work has practical implications for scalable, robust post-training across domains where synthetic data generation is central.

Abstract

Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model performance improves, raising a crucial question: How should the total budget for generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework for analyzing budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies -- particularly exponential growth policies -- exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping

TL;DR

This work addresses how to allocate a fixed computational budget across iterations in iterative synthetic-data bootstrapping for post-training. It develops a theoretical framework showing constant policies fail to converge with high probability, while increasing policies, particularly exponential growth, guarantee exponential convergence and often beat polynomial schemes in worst-case cost. The authors validate these claims with two experiments—image denoising using diffusion probabilistic models and math reasoning with large language models—where exponential growth policies consistently outperform constant policies and generally surpass linear growth in stability and final performance. The results provide principled guidance for resource-efficient post-training, with potential extensions to RLHF and broader iterative training settings. The work has practical implications for scalable, robust post-training across domains where synthetic data generation is central.

Abstract

Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model performance improves, raising a crucial question: How should the total budget for generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework for analyzing budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies -- particularly exponential growth policies -- exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.

Paper Structure

This paper contains 51 sections, 11 theorems, 97 equations, 8 figures, 3 tables, 3 algorithms.

Key Result

Theorem 3.1

If the initial parameter satisfies $\theta^{(0)} \leq (1 + \sigma^2/\kappa^2)^T (\sigma^2 + \kappa^2)^{1/2},$ then the optimal iterative policy, i.e., the solution of the optimization problem in eq:opt-intuition, is given by

Figures (8)

  • Figure 1: Iterative learning with synthetic data. In this framework, synthetic data is generated, filtered using a reward model, and the selected data is used to further train the generator. The budget policy is defined as the quantity of data retained after selection, $n_t$. Our goal is to identify the optimal policy across iterations to achieve the best final performance, given a fixed budget.
  • Figure 2: Empirical results of the toy example. We compare the exponential, constant, and linear policy, and show the gap to the optimal expected reward as a function of the computational cost ($\sum_{t}n_t$). All the results are averaged over 1,000 runs.
  • Figure 3: Empirical results of image denoising. The figure shows the average PSNR of the generated denoised images as a function of computational cost. Computational cost is measured in floating-point operations (FLOPs) during iterative learning, including both generation and training. The training batch size is set to $B=640$. We adopt $n_t = 1.1^t \cdot B$ and $n_t = 1.05^t \cdot B$ as the exponential growth policies for $s=10$ and $s=20$, respectively.
  • Figure 4: Empirical results of math reasoning. The figure shows the accuracies with respect to the computational cost, measured in FLOPs as well. The training batch size is set to $B=256$. We adopt $n_t = 10 \cdot 2^t \cdot B$ as the exponential policy. We set the temperature to 0.3 during the generation step and include the results for a temperature of 0.7 in \ref{['app:temp=0.7_math']}, which lead to the same conclusions.
  • Figure 5: Additional experimental results of the toy example with diverse parameters.
  • ...and 3 more figures

Theorems & Definitions (16)

  • Theorem 3.1
  • Theorem 4.1: Bounded Reward for Constant policy
  • Theorem 4.2: Optimal Reward for Increasing Policies
  • Theorem 4.3: Convergence Rate for the Exponential Policy
  • Theorem 4.4: Worst-Case Optimality of the Exponential Policy, informal
  • Lemma A.1
  • proof
  • Remark
  • Remark
  • Remark
  • ...and 6 more