Table of Contents
Fetching ...

Context-Free Synthetic Data Mitigates Forgetting

Parikshit Bansal, Sujay Sanghavi

TL;DR

Catastrophic forgetting during fine-tuning of foundation models is mitigated by context-free synthetic data (CFS), which uses unconditional samples to approximate the KL divergence $KL(p_{\theta^*}\|p_\theta)$. The method adds a two-term objective: min_θ E_{(x,y)∼F}[-log p_θ(y|x)] + λ E_{x∼p_{\theta^*}}[-log p_θ(x)], where context-free samples are obtained by prompting with the bos token to generate $x∼p_{\theta^*}$. The authors validate CFS in two settings (pretrained-only Olmo-1B on MetaMathQA for GSM8K and a reasoning setup with R1-Distill-Llama-8B on MedReason) and show improved retention of pretraining and reasoning abilities over baselines such as LoRA, $\ell_2$, and Wise-FT. The results indicate a practical, data-oblivious strategy to stabilize model behavior during domain adaptation and reasoning tasks, with implications for open-weight models where training data remain unavailable.

Abstract

Fine-tuning a language model often results in a degradation of its existing performance on other tasks, due to a shift in the model parameters; this phenomenon is often referred to as (catastrophic) forgetting. We are interested in mitigating this, in settings where we only have access to the model weights but no access to its training data/recipe. A natural approach is to penalize the KL divergence between the original model and the new one. Our main realization is that a simple process - which we term context-free generation - allows for an approximate unbiased estimation of this KL divergence. We show that augmenting a fine-tuning dataset with context-free generations mitigates forgetting, in two settings: (a) preserving the zero-shot performance of pretrained-only models, and (b) preserving the reasoning performance of thinking models. We show that contextual synthetic data, and even a portion of the pretraining data, are less effective. We also investigate the effect of choices like generation temperature, data ratios etc. We present our results for OLMo-1B for pretrained-only setting and R1-Distill-Llama-8B for the reasoning setting.

Context-Free Synthetic Data Mitigates Forgetting

TL;DR

Catastrophic forgetting during fine-tuning of foundation models is mitigated by context-free synthetic data (CFS), which uses unconditional samples to approximate the KL divergence . The method adds a two-term objective: min_θ E_{(x,y)∼F}[-log p_θ(y|x)] + λ E_{x∼p_{\theta^*}}[-log p_θ(x)], where context-free samples are obtained by prompting with the bos token to generate . The authors validate CFS in two settings (pretrained-only Olmo-1B on MetaMathQA for GSM8K and a reasoning setup with R1-Distill-Llama-8B on MedReason) and show improved retention of pretraining and reasoning abilities over baselines such as LoRA, , and Wise-FT. The results indicate a practical, data-oblivious strategy to stabilize model behavior during domain adaptation and reasoning tasks, with implications for open-weight models where training data remain unavailable.

Abstract

Fine-tuning a language model often results in a degradation of its existing performance on other tasks, due to a shift in the model parameters; this phenomenon is often referred to as (catastrophic) forgetting. We are interested in mitigating this, in settings where we only have access to the model weights but no access to its training data/recipe. A natural approach is to penalize the KL divergence between the original model and the new one. Our main realization is that a simple process - which we term context-free generation - allows for an approximate unbiased estimation of this KL divergence. We show that augmenting a fine-tuning dataset with context-free generations mitigates forgetting, in two settings: (a) preserving the zero-shot performance of pretrained-only models, and (b) preserving the reasoning performance of thinking models. We show that contextual synthetic data, and even a portion of the pretraining data, are less effective. We also investigate the effect of choices like generation temperature, data ratios etc. We present our results for OLMo-1B for pretrained-only setting and R1-Distill-Llama-8B for the reasoning setting.

Paper Structure

This paper contains 20 sections, 5 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: We finetune Olmo-1B groeneveld2024olmo model on MetaMathQA yu2023metamath dataset with the aim of improving GSM8K accuracy while maintaining it's pre-existing (i.e., pretrained) abilities (kindly refer to Sec \ref{['subsec:pretrainingresults']} for details). Our method CFS, augments the downstream data with context-free synthetic data (Sec \ref{['sec:context_free']}) and performs better than the considered baselines. Pretrain-Aug augments MetaMathQA with pretraining data, LoRA trains a low-rank adaptation, $l_2$ regularization regularizes model towards it's initialization and Wise-FT does post-hoc model averaging of Finetuned and Base.
  • Figure 2: We finetune R1-Distill-Llama-8B deepseekai model on MedReason wu2025medreason dataset with the aim of improving it's medical abilities, while maintaing it's reasoning performance (kindly refer to Sec \ref{['subsec:reasoningresults']} for details). Our method CFS, augments the downstream data with context-free synthetic data (Sec \ref{['sec:context_free']}) and performs better than the considered baselines, namely LoRA which trains a low-rank adaptation, which regularizes model towards it's initialization and Wise-FT which does post-hoc model averaging of Finetuned and Base.