Reusing Pre-Training Data at Test Time is a Compute Multiplier
Alex Fang, Thomas Voice, Ruoming Pang, Ludwig Schmidt, Tom Gunter
TL;DR
This paper probes whether the knowledge embedded in open pre-training datasets can be extracted at test time by retrieval and augmented with additional test-time compute. Using retrieval-augmented generation across public corpora and a test-time compute protocol, it shows notable gains on MMLU, Math-500, and SimpleQA, with an average compute multiplier of about $\sim5\times$ that declines as models scale, and further improvements when parsing retrieved context. Decontamination largely preserves gains on MMLU while revealing some contamination in Math-500, underscoring the need for held-out evaluation sets. The findings suggest that current pre-training underutilizes available data and that retrieval plus test-time computation can unlock substantial additional performance, with implications for dataset construction and efficient deployment of LLMs.
Abstract
Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.
