Table of Contents
Fetching ...

The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek, Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J. Zico Kolter, Bogdan Gaza, Ari Morcos, Matthew Leavitt, Pratyush Maini

Abstract

Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain-data repetition for a given pretraining compute budget. Our observations reveal the finetuner's fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.

The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

Abstract

Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain-data repetition for a given pretraining compute budget. Our observations reveal the finetuner's fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.
Paper Structure (55 sections, 5 equations, 22 figures)

This paper contains 55 sections, 5 equations, 22 figures.

Figures (22)

  • Figure 1: Specialized pretraining (SPT) mixes the finetuning dataset into pretraining as a small fraction of tokens, repeating it 10–50× over the course of training. Compared to general pretraining (dashed), SPT (solid) achieves lower domain test loss (blue) and less forgetting of general knowledge (gold) throughout finetuning. For narrow domains, these gains can overcome differences in model scale.
  • Figure 2: Specialized pretraining (SPT) outperforms finetuning-only across domains. We pretrain models with a small fraction ($\delta$) of domain-specific tokens mixed into general web data, then finetune on the domain dataset. We plot the best post-finetuning domain loss across pretraining budgets for MusicPile, ChemPile, and ProofPile (300M tokens each). Even small domain mixtures ($\delta = 1$--$5\%$, blue curves) consistently outperform pretraining on general data alone ($\delta = 0\%$, gray) at all token scales.
  • Figure 3: The finetuner's tax. Training a 1B model with specialized pretraining (SPT) costs more upfront than finetuning a 3B model on domain data alone, but the $3\times$ smaller model is cheaper to serve. The break-even point arrives after approximately 1 trillion inference tokens, after which SPT saves both compute and money while often delivering comparable or better performance.
  • Figure 4: Specialized pretraining (SPT) is more effective than scaling tokens or parameters. We compare models that include domain data during pretraining (SPT$\to$FT) against models pretrained only on general web data (NPT$\to$FT). Left: Relative gain of SPT$\to$FT over NPT$\to$FT. Center: Compute multiplier showing how much faster SPT reaches NPT's best performance. Right: Percentage of the 1B vs. 3B parameter performance gap closed by SPT. Values above 100% indicate the 1B SPT outperforms 3B NPT. SPT consistently improves model quality, training speed, and parameter efficiency (i.e. SPT is a pareto improvement in training efficiency) across all three examined domains.
  • Figure 5: SPT reduces forgetting and improves downstream task performance. (a) For ChemPile, we plot Dolma loss (general knowledge) against domain loss for the best post-finetuning checkpoint at each pretraining budget (40B to 200B tokens) and mixture percentage $\delta$. Larger SPT mixtures achieve lower domain loss and lower general loss, indicating less catastrophic forgetting. (b) We compare NPT (gray) and 2% SPT (blue) on downstream tasks matched to each domain: MusicTheoryBench for MusicPile, ChemBench General Chemistry subset for ChemPile, and MATH for ProofPile. All tasks are evaluated in 4-choice MCQA format. For each pretraining budget, we report the best accuracy across finetuning runs. SPT outperforms NPT across most settings.
  • ...and 17 more figures