Table of Contents
Fetching ...

Synthetic Data for any Differentiable Target

Tristan Thrush, Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Christopher Potts, Tatsunori Hashimoto

Abstract

What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.

Synthetic Data for any Differentiable Target

Abstract

What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern , and (3) have lower norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.

Paper Structure

This paper contains 29 sections, 5 theorems, 45 equations, 7 figures, 13 tables, 1 algorithm.

Key Result

Theorem 3.1

Suppose we train the target model in $\mathcal{A}$ for $T$ steps of minibatch stochastic gradient descent (SGD) with batch size $B$ and a learning rate of $\eta$. Under suitable regularity conditions on smoothness (Appendix app:proofs, A1-A8), we have: N.B. -- although it may be clear to some, the notation can be tricky to keep straight. In this equation, we take the gradient of $F'$ with respect

Figures (7)

  • Figure 1: Dataset Policy Gradients allow us to generate synthetic training data for any differentiable target. For example, our generator can learn to generate special Wikipedia article rephrases. When used for continued pretraining of GPT-2, these rephrases turn the upper left 21x21 patch of GPT-2's LM head weight matrix into the QR code seen here (when subtracted from the initial weights, sign'd, and visualized as a greyscale image). The text sample in this figure is the first item in the synthetic dataset, which we generated with a temperature of 1 (i.e., noisy data still produces the result).
  • Figure 2: Here, we initialize the target model in $\mathcal{A}$ to be GPT-2, and explore exotic target metrics: the goal of the first metric is to encode the greyscale image 67 in the upper 6x7 patch of the sign'd LM head weight updates to the target model. This number was chosen arbitrarily. The goal of the second metric is to lower the $\ell^2$ norm of the target model's LM head. The plots show validation performance as the GRPO process trains the generator. All validations are done with 96 steps of continued training on GPT-2. The (96), (8), and (1) notation denotes whether the generator was trained via metagradients with respect to an $\mathcal{A}$ that used 96, 8, or 1 step(s). We observe a weak correlation between $\mathcal{A}$ steps and validation performance, and generally more validation stability with more $\mathcal{A}$ steps.
  • Figure 3: Final validation results for the 6x7 pixel images in the target models' sign'd LM head updates, after the generator was fully trained. The numbers above the images denote the number of target model training steps in $\mathcal{A}$ for metagradient computation. All validations were done with 96 target model training steps, using the corresponding optimizer; the difference is whether the generator was trained using a reward function with fewer $\mathcal{A}$ training steps. Only Adam with 96 steps in $\mathcal{A}$ for metagrads achieved a generator that got a perfect result (we were close with the initial 96 run, so we trained the generator again with a different random sample of Wikipedia prompts -- we then got a perfect score).
  • Figure 4: Generator results when setting $\Phi$ to be post-training loss on four multilingual LAMBADA lambada translations from multilingual_lambada: DE, ES, FR, and IT. We initialized the generator from Llama 3.2 Instruct. We initialized the target model in $\mathcal{A}$ also from Llama 3.2 Instruct. In each GRPO step, we conduct a single step of target model continued pretraining on the synthetic data before computing metagradients. When using Adam in $\mathcal{A}$, the generator learns the correct language, as judged by GPT 4.1 Nano openai_gpt41_nano. Baselines do not learn the correct language except in rare cases where their entropy quickly collapses and they repeatedly produce only a few words.
  • Figure 5: We keep the same setup as the LAMBADA cases, with the exception of changing $\Phi$ to be the target model's post-training LM loss on a 32-character UUID. In this plot, we show two validation metrics: Exact requires the complete UUID to be in a rollout, and Soft finds the longest substring of the UUID in the rollout and gives points proportional to the fraction of the UUID present.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Theorem 3.1
  • Lemma A.9
  • proof
  • proof
  • Lemma A.10
  • proof
  • Lemma A.11
  • proof
  • Theorem A.11
  • proof