The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
Bingchen Zhao, Despoina Magka, Minqi Jiang, Xian Li, Roberta Raileanu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Kelvin Niu, Shagun Sodhani, Michael Shvartsman, Andrei Lupu, Alisia Lupidi, Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Thomas Foster, Lucia Cipolina-Kun, Abhishek Charnalia, Derek Dunfield, Alexander H. Miller, Oisin Mac Aodha, Jakob Foerster, Yoram Bachrach
TL;DR
The paper introduces the Automated LLM Speedrunning Benchmark to assess AI agents' ability to reproduce sequential scientific improvements, using the NanoGPT speedrun as a testbed. By defining records between consecutive ground-truth solutions and providing multi-level hints, the benchmark evaluates whether language-model agents can recover or exceed prior speedups under a fixed hardware setup, via the Fraction of Speedup Recovered (FSR) metric. Across multiple LLMs and search scaffolds, the study finds that even strong reasoning models struggle to reproduce ground-truth speedups, especially without hints, and that performance degrades on later records, highlighting the difficulty of automated scientific reproduction. The benchmark offers a simple yet challenging, non-saturated measure to gauge progress toward autonomous scientific exploration, with potential to guide the development of more capable AI research agents while emphasizing the limits of current models. Overall, the work provides a rigorous, reproducible framework for evaluating AI-driven reproducibility across chained research advances in a frontier ML task.
Abstract
Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.
