Table of Contents
Fetching ...

Reusing Pre-Training Data at Test Time is a Compute Multiplier

Alex Fang, Thomas Voice, Ruoming Pang, Ludwig Schmidt, Tom Gunter

TL;DR

This paper probes whether the knowledge embedded in open pre-training datasets can be extracted at test time by retrieval and augmented with additional test-time compute. Using retrieval-augmented generation across public corpora and a test-time compute protocol, it shows notable gains on MMLU, Math-500, and SimpleQA, with an average compute multiplier of about $\sim5\times$ that declines as models scale, and further improvements when parsing retrieved context. Decontamination largely preserves gains on MMLU while revealing some contamination in Math-500, underscoring the need for held-out evaluation sets. The findings suggest that current pre-training underutilizes available data and that retrieval plus test-time computation can unlock substantial additional performance, with implications for dataset construction and efficient deployment of LLMs.

Abstract

Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.

Reusing Pre-Training Data at Test Time is a Compute Multiplier

TL;DR

This paper probes whether the knowledge embedded in open pre-training datasets can be extracted at test time by retrieval and augmented with additional test-time compute. Using retrieval-augmented generation across public corpora and a test-time compute protocol, it shows notable gains on MMLU, Math-500, and SimpleQA, with an average compute multiplier of about that declines as models scale, and further improvements when parsing retrieved context. Decontamination largely preserves gains on MMLU while revealing some contamination in Math-500, underscoring the need for held-out evaluation sets. The findings suggest that current pre-training underutilizes available data and that retrieval plus test-time computation can unlock substantial additional performance, with implications for dataset construction and efficient deployment of LLMs.

Abstract

Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.

Paper Structure

This paper contains 25 sections, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Retrieval on the pre-training dataset can substantially improve upon the performance of the base model. However, the exact benefit depends on the type of task.
  • Figure 2: MMLU Breakdown by category of impact of retrieval addition and compute budget. Retrieval provides a strong lift, and the difference between retrieving from a random subset of the data store and the full set is small and diminishing with scale.
  • Figure 3: For SimpleQA, our retrieval system is fairly robust to scaling the retrieval datastore, even if the new data does not contain useful information. Our custom Wikipedia contains 22B tokens, and additional DCLM data helps a little, or when also starting with additional golden link data, hurts only a little.
  • Figure 4: Inter-document consistency can be used to analyze retrieval and consistency. We apply self-consistency on generating while retrieving from individual documents, and select the answer from the most self-consistent document.