Table of Contents
Fetching ...

Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling

Jannek Ulm, Kevin Du, Vésteinn Snæbjarnarson

TL;DR

This work probes whether contrastive decoding (CD) can be repurposed to generate high-signal synthetic data for pretraining language models under tight data budgets. By contrasting a GOOD and BAD model trained on the same base corpus, CD biases synthetic text toward informative continuations, and when mixed with real data, yields improvements on LM objectives and downstream tasks, especially for reasoning and tracking. Vanilla (non-contrastive) sampling remains strongest for perplexity and core grammatical benchmarks, while CD excels on tasks requiring multi-step inference and world knowledge, suggesting a practical division of labor for synthetic-data generation. The findings highlight data-efficient pretraining potential using CD, while noting limitations related to scale, compute, diversity, and safety that warrant careful future work.

Abstract

Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.

Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling

TL;DR

This work probes whether contrastive decoding (CD) can be repurposed to generate high-signal synthetic data for pretraining language models under tight data budgets. By contrasting a GOOD and BAD model trained on the same base corpus, CD biases synthetic text toward informative continuations, and when mixed with real data, yields improvements on LM objectives and downstream tasks, especially for reasoning and tracking. Vanilla (non-contrastive) sampling remains strongest for perplexity and core grammatical benchmarks, while CD excels on tasks requiring multi-step inference and world knowledge, suggesting a practical division of labor for synthetic-data generation. The findings highlight data-efficient pretraining potential using CD, while noting limitations related to scale, compute, diversity, and safety that warrant careful future work.

Abstract

Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.

Paper Structure

This paper contains 57 sections, 7 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Our synthetic data generation and training pipeline: Start by training baseline LMs on a "real" corpus (TinyBabyLM: human-written text + TinyStories). The $\mathsf{GOOD}$ model is the best checkpoint; the $\mathsf{BAD}$ model is a weaker variant, e.g., an earlier checkpoint. We generate synthetic corpora via (i) contrastive decoding (CD), and (ii) non-contrastive ancestral (vanilla) sampling. We then train new models on a mixture of the original and synthetic corpora. We find that contrastive models improve the most over the Baseline in evaluations on reasoning-oriented benchmarks, such as entity tracking.
  • Figure 2: Top-$k$ and top-$p$ truncation under ancestral decoding. "Vanilla" denotes ancestral sampling from unmodified logits after $\mathsf{CD}$ or No-Contrast. On downstream tasks, $k{=}200$ is the strongest setting; perplexity exhibits no single optimum. Full results in \ref{['tab:AllModels']}.
  • Figure 3: Mixing ratio ablation for CD-generated synthetic corpora (CD-Early-500), also see in \ref{['tab:AllModels']}. The ratio indicates the fraction of synthetic data in training batches. $\mu_{\Delta \mathrm{REL}}$ is the mean relative improvement over Baseline across non-perplexity tasks; Perplexity shows relative change vs. Baseline;. A 30% mix yields the best overall $\mu_{\Delta \mathrm{REL}}$ (+4.90%), while 40% attains the lowest perplexity (23.42).