Table of Contents
Fetching ...

Towards Active Synthetic Data Generation for Finetuning Language Models

Samuel Kessler, Menglin Xia, Daniel Madrigal Diaz, Dongge Han, Helia Heshemi, Saravan Rajmohan, Victor Ruehle, Jordan T. Ash

TL;DR

This work tackles the data-efficiency challenge in finetuning language models by proposing an iterative, teacher-guided synthetic data generation loop. By conditioning data generation on the evolving state of the student and applying simple active-learning selection methods, the approach achieves stronger performance with fewer synthetic examples than static data generation. Across four mathematical and logical reasoning benchmarks and multiple small models, high-loss (uncertainty) based data selection consistently yields the best data efficiency, while expensive LLM-based judges offer diminishing returns. The results demonstrate that synthetic data retain key properties of the seed data and that this steerable curriculum significantly improves SFT effectiveness in practical settings.

Abstract

A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.

Towards Active Synthetic Data Generation for Finetuning Language Models

TL;DR

This work tackles the data-efficiency challenge in finetuning language models by proposing an iterative, teacher-guided synthetic data generation loop. By conditioning data generation on the evolving state of the student and applying simple active-learning selection methods, the approach achieves stronger performance with fewer synthetic examples than static data generation. Across four mathematical and logical reasoning benchmarks and multiple small models, high-loss (uncertainty) based data selection consistently yields the best data efficiency, while expensive LLM-based judges offer diminishing returns. The results demonstrate that synthetic data retain key properties of the seed data and that this steerable curriculum significantly improves SFT effectiveness in practical settings.

Abstract

A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.

Paper Structure

This paper contains 54 sections, 3 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of iterative synthetic data generation (\ref{['alg:iter_syn_data_gen']}). The student model guides synthetic data generation by prioritizing which data are used as an example for the teacher model to generate a new synthetic data point (\ref{['sec:prompt_syn_gen']}). The student finetunes on teacher generated synthetic data.
  • Figure 2: SFT performance on $1$k data points for various datasets and SLMs. We compare the effect of synthetic answer generation and synthetic question and answer generation to using the seed dataset, $D_0$ for SFT. $0$-shot SLM and teacher performances are included for reference. All datasets use a GPT- 4o teacher, for Game of 24 we use a GPT- o3- mini teacher.
  • Figure 3: Student performance over successive synthetic data iterations with growing training sets (\ref{['alg:iter_syn_data_gen']}). Each inset plot shows the proportion of data random sampling requires for the same performance as the best active scorers for synthetic data generation
  • Figure 4: Pairwise winrate over all datasets and methods. ${\bm{P}}_{ij}$ corresponds to the number of times algorithm $i$ outperforms $j$. Overall performance is shown in the last row (lower is better).
  • Figure 5: Iterative synthetic data generation learning curves on GSM8k: student performance versus the number of teacher input and output tokens. The total number of input and output tokens are a proxy for the amount of compute used by the teacher for various selection methods.
  • ...and 9 more figures