Table of Contents
Fetching ...

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng

Abstract

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

Abstract

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

Paper Structure

This paper contains 35 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Trade-off between pretraining and retrieval under a fixed data budget.Left: We train OLMo-2 models ranging from 30M to 3B parameters on DCLM data while constructing retrieval stores from held-out portions of the same corpus. Center: We conceptualize this as an optimization problem over a 2D allocation space of pretraining and retrieval tokens. For a fixed data budget, feasible configurations lie along a constraint frontier, and performance varies smoothly; our goal is to identify the optimal allocation along this frontier. Right: Retrieval allocation trade-off at fixed pretraining scale. As the % of data used for retrieval increases, performance changes non-monotonically, with scale dependence: smaller models benefit most, while larger models exhibit diminishing returns and over-allocation sensitivity.
  • Figure 2: Parametric scaling baselines without RAG ($R=0$). Left: Empirical measurements across model sizes and data budgets, overlaid with iso-loss contours from the power-law model. Each point corresponds to a trained model configuration, colored by observed perplexity. The blue line denotes the compute-efficient frontier and the vertical dashed lines, discrete training budgets. Right: Iso-compute slices of the scaling surface, showing predicted loss as a function of model size ($N$). Empirical observations are overlaid for reference.
  • Figure 3: Trade-off between pretraining and retrieval under a fixed data budget.Left: We quantify the substitutability between retrieval and pretraining via the number of pretraining tokens saved per retrieval token, computed by fitting scaling laws and determining, for each pretraining scale, the amount of retrieval required to match baseline performance without retrieval. The dotted line represents a linear line-of-best-fit across all model scales. Right: We measure the marginal benefit of retrieval as perplexity improvement per billion retrieval tokens (higher is better) for models trained near their optimal pretraining ratio.
  • Figure 4: Effect of retrieval query formulation on performance. Comparison of standard generation on SimpleQA without retrieval (Baseline) to RAG under two query formulations: (i) RAG (Query), which retrieves top-$k$ passages using only the question, and (ii) RAG (Query + Gold), which includes the gold answer in the query too (an oracle-style ablation). SimpleQA is not multiple-choice (no answer choices), so we do not report RAG (Query + Choices) here. All methods use a shared corpus index constructed from 20% of the data, retrieving the top-5 passages per query. Left: OLMo-2 136M. Right: OLMo-2 1B.
  • Figure 5: Calibration plots for 3D scaling law fits across benchmarks. We show the alignment between predicted and observed $PPL$ for $N, D,$ and $R$ across six benchmarks. The tight grouping around the diagonal indicates that the usual Hoffman power-law formulation with a log term for retrieval effectively captures the retrieval-augmented scaling behavior.
  • ...and 3 more figures