Table of Contents
Fetching ...

Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities

Urs Spiegelhalter, Jörg K. H. Franke, Frank Hutter

TL;DR

The paper tackles the problem of adapting language models to new tasks without catastrophic forgetting under finite compute by balancing synthetic data augmentation and replay. It introduces a systematic, token-budget–driven study using the bAbI reasoning tasks and a 1.7B base model to quantify how total token budget and replay ratio affect task mastery and general knowledge retention. Key findings show that 5%–10% replay suffices to preserve general knowledge, performance plateaus around a total budget of $10^{8.5}$ tokens, and synthetic data diversity (bAbI-Synthetic) markedly improves task mastery compared with original data; these insights lead to practical guidelines for replay selection and budgeting. The results enable efficient task-specific adaptation in resource-constrained settings and offer a roadmap for extending continual-learning analyses to other tasks and synthetic-generation methods.

Abstract

Adapting language models to new tasks through continued pretraining faces a fundamental trade-off: models must learn new capabilities while avoiding catastrophic forgetting of existing knowledge. While prior work has studied synthetic data generation techniques, the optimal replay ratios for balancing task performance and knowledge retention under computational constraints remain poorly understood. We present a comprehensive empirical study investigating the interplay between replay ratio configuration and computational budget when adapting language models to new tasks. Using the bAbI reasoning tasks as our target objective, we apply synthetic data generation and systematically evaluate different total token budgets and replay ratio configurations. We analyze their effects on both task mastery and general knowledge retention. Our experiments reveal an optimal configuration that balances task-specific performance with general knowledge retention. Based on our findings, we provide empirically-grounded guidelines for selecting replay ratios based on computational budget, enabling practitioners to achieve strong task adaptation with significantly reduced training costs.

Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities

TL;DR

The paper tackles the problem of adapting language models to new tasks without catastrophic forgetting under finite compute by balancing synthetic data augmentation and replay. It introduces a systematic, token-budget–driven study using the bAbI reasoning tasks and a 1.7B base model to quantify how total token budget and replay ratio affect task mastery and general knowledge retention. Key findings show that 5%–10% replay suffices to preserve general knowledge, performance plateaus around a total budget of tokens, and synthetic data diversity (bAbI-Synthetic) markedly improves task mastery compared with original data; these insights lead to practical guidelines for replay selection and budgeting. The results enable efficient task-specific adaptation in resource-constrained settings and offer a roadmap for extending continual-learning analyses to other tasks and synthetic-generation methods.

Abstract

Adapting language models to new tasks through continued pretraining faces a fundamental trade-off: models must learn new capabilities while avoiding catastrophic forgetting of existing knowledge. While prior work has studied synthetic data generation techniques, the optimal replay ratios for balancing task performance and knowledge retention under computational constraints remain poorly understood. We present a comprehensive empirical study investigating the interplay between replay ratio configuration and computational budget when adapting language models to new tasks. Using the bAbI reasoning tasks as our target objective, we apply synthetic data generation and systematically evaluate different total token budgets and replay ratio configurations. We analyze their effects on both task mastery and general knowledge retention. Our experiments reveal an optimal configuration that balances task-specific performance with general knowledge retention. Based on our findings, we provide empirically-grounded guidelines for selecting replay ratios based on computational budget, enabling practitioners to achieve strong task adaptation with significantly reduced training costs.

Paper Structure

This paper contains 46 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Full analysis of the interplay between different total token budgets and replay percentages. The left column shows results for bAbI-Original and the right column shows results for bAbI-Synthetic. 100$\%$ replay is exclusively training on DCLM-Edu up to that total token budget. The combined scores indicate the most valuable configurations for task mastery and general knowledge retention.
  • Figure 2: $3 \times 3$ grid search over learning rate and batch size. We used 100k bAbI-Synthetic samples with $50\%$ replay.
  • Figure 3: Number of epochs used for bAbI-Original and number of samples per task used for bAbI-Synthetic for each configuration.
  • Figure 4: GSM8K-CoT (8-shot) performance utilizing the DCLM-Edu replay dataset across replay percentages of $20\%$, $40\%$, $60\%$, and $80\%$.
  • Figure 5: GSM8K-CoT (8-shot) performance utilizing a replay dataset comprising $85\%$ DCLM-Edu and $15\%$ AugGSM8K across total replay percentages of $5\%$, $10\%$, $15\%$, $20\%$, and $25\%$.
  • ...and 1 more figures