Table of Contents
Fetching ...

SoftSRV: Learn to Generate Targeted Synthetic Data

Giulia DeSalvo, Jean-Fracois Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar

TL;DR

SoftSRV tackles the challenge of generating targeted synthetic fine-tuning data without labor-intensive prompt engineering. It learns small parametric embeddings that condition a frozen LLM to produce data mirroring a target distribution, using an autoencoder-like reconstruction objective. The framework introduces three parameterizations—Non-contextual (SS_NP), Mixture (SS_MPk), and MLP-Concatenated (SS_MC)—with contextual embeddings proving crucial for diversity and fidelity, and SS_MC delivering the strongest downstream gains on coding, math, and reasoning benchmarks. Empirically, SoftSRV outperforms natural-language prompt templates and shows strong distribution alignment via MAUVE, with additional benefits in out-of-domain transfer and data-scaling. This approach reduces human effort, enhances generality across domains, and suggests future work on adaptive context selection and broader deployment in real-world fine-tuning pipelines.

Abstract

We present a novel framework, SoftSRV, that is used to generate targeted synthetic fine-tuning data for improving task-specific model performance. Given a sample from a target distribution, our proposed framework uses a data-driven loss minimization approach to steer a frozen large language model (LLM) to generate synthetic sequences that are similar to those from the target distribution. SoftSRV provides a practical improvement over common prompt engineering approaches that rely on human-engineered prompt-templates, which can be idiosyncratic, labor-intensive to craft, and may need to be specialized per domain. We empirically evaluate our method against standard baselines guiding a large LLM to generate synthetic data to fine-tune a smaller language model on three different domains (coding, math, reasoning). We perform these evaluations without any particular specialization of the framework to each domain, emphasizing the generality of our approach. We find that SoftSRV improves upon typical prompt engineering approaches, generating targeted data that leads to fine-tuned models with significantly better task-specific performance. In addition, SoftSRV-generated data better matches the target distribution according to the MAUVE similarity metric.

SoftSRV: Learn to Generate Targeted Synthetic Data

TL;DR

SoftSRV tackles the challenge of generating targeted synthetic fine-tuning data without labor-intensive prompt engineering. It learns small parametric embeddings that condition a frozen LLM to produce data mirroring a target distribution, using an autoencoder-like reconstruction objective. The framework introduces three parameterizations—Non-contextual (SS_NP), Mixture (SS_MPk), and MLP-Concatenated (SS_MC)—with contextual embeddings proving crucial for diversity and fidelity, and SS_MC delivering the strongest downstream gains on coding, math, and reasoning benchmarks. Empirically, SoftSRV outperforms natural-language prompt templates and shows strong distribution alignment via MAUVE, with additional benefits in out-of-domain transfer and data-scaling. This approach reduces human effort, enhances generality across domains, and suggests future work on adaptive context selection and broader deployment in real-world fine-tuning pipelines.

Abstract

We present a novel framework, SoftSRV, that is used to generate targeted synthetic fine-tuning data for improving task-specific model performance. Given a sample from a target distribution, our proposed framework uses a data-driven loss minimization approach to steer a frozen large language model (LLM) to generate synthetic sequences that are similar to those from the target distribution. SoftSRV provides a practical improvement over common prompt engineering approaches that rely on human-engineered prompt-templates, which can be idiosyncratic, labor-intensive to craft, and may need to be specialized per domain. We empirically evaluate our method against standard baselines guiding a large LLM to generate synthetic data to fine-tune a smaller language model on three different domains (coding, math, reasoning). We perform these evaluations without any particular specialization of the framework to each domain, emphasizing the generality of our approach. We find that SoftSRV improves upon typical prompt engineering approaches, generating targeted data that leads to fine-tuned models with significantly better task-specific performance. In addition, SoftSRV-generated data better matches the target distribution according to the MAUVE similarity metric.

Paper Structure

This paper contains 30 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: A diagram illustrating the training workflow of the SoftSRV framework. An example sequence $x$ is embedded into a dense vector $\mathbf{z}$ via a (frozen) sequence encoder model. The SoftSRV model, parameterized by $\theta$ and conditioned on the embedding $\mathbf{z}$, produces a SoftSRV embedding $\mathbf{P}_\theta(\mathbf{z})$. This is then fed to a (frozen) pre-trained LLM, which produces a synthetic example $x'$. Similar to autoencoder-based training, the gradient of a next-word-prediction "reconstruction" loss is computed and used to update the SoftSRV parameters.
  • Figure 2: An illustration of the workflow for generating synthetic SFT data with SoftSRV (top panel) and natural language prompt templates (bottom panel); step numbering matches the discussion in Section \ref{['pipeline']}. For SoftSRV, we use questions from the train set to (1) train a contextual embedding, (2) embed the same training sequences to serve as context vector to the trained embedding which is then fed to the LLM to generate synthetic questions, and (3) we generate answers by simply feeding the synthetic questions to an LLM. In the baseline prompt template framework, we have no training step (1) albeit there is offline human effort needed to generate the various prompt templates for each data domain; step (2) generates questions, using questions from the train set to fill the natural language template (optionally conduct rounds of refinement prompting), and (3) generates answers to these questions again using a template, but populated with the synthetic questions.
  • Figure 3: Gemma 2 (2B) fine-tuning curves for the synthetically generated datasets as well as the non-synthetic training set.
  • Figure 4: We compare the BoolQ performance of Gemma 2B fine-tuned on data generated by $\mathrm{PT}$ and $\mathrm{SS}_\mathrm{NP}$ as the number of generated examples increases.
  • Figure 5: Performance of $\mathrm{PT}$ method with and without diversification, i.e. the phrase generate "10 different questions".
  • ...and 8 more figures