Context-Free Synthetic Data Mitigates Forgetting
Parikshit Bansal, Sujay Sanghavi
TL;DR
Catastrophic forgetting during fine-tuning of foundation models is mitigated by context-free synthetic data (CFS), which uses unconditional samples to approximate the KL divergence $KL(p_{\theta^*}\|p_\theta)$. The method adds a two-term objective: min_θ E_{(x,y)∼F}[-log p_θ(y|x)] + λ E_{x∼p_{\theta^*}}[-log p_θ(x)], where context-free samples are obtained by prompting with the bos token to generate $x∼p_{\theta^*}$. The authors validate CFS in two settings (pretrained-only Olmo-1B on MetaMathQA for GSM8K and a reasoning setup with R1-Distill-Llama-8B on MedReason) and show improved retention of pretraining and reasoning abilities over baselines such as LoRA, $\ell_2$, and Wise-FT. The results indicate a practical, data-oblivious strategy to stabilize model behavior during domain adaptation and reasoning tasks, with implications for open-weight models where training data remain unavailable.
Abstract
Fine-tuning a language model often results in a degradation of its existing performance on other tasks, due to a shift in the model parameters; this phenomenon is often referred to as (catastrophic) forgetting. We are interested in mitigating this, in settings where we only have access to the model weights but no access to its training data/recipe. A natural approach is to penalize the KL divergence between the original model and the new one. Our main realization is that a simple process - which we term context-free generation - allows for an approximate unbiased estimation of this KL divergence. We show that augmenting a fine-tuning dataset with context-free generations mitigates forgetting, in two settings: (a) preserving the zero-shot performance of pretrained-only models, and (b) preserving the reasoning performance of thinking models. We show that contextual synthetic data, and even a portion of the pretraining data, are less effective. We also investigate the effect of choices like generation temperature, data ratios etc. We present our results for OLMo-1B for pretrained-only setting and R1-Distill-Llama-8B for the reasoning setting.
