Can LLMs Write Mathematics Papers? A Case Study in Reservoir Computing
Allen G Hart
TL;DR
The study investigates whether frontier LLMs can perform mathematics research by prompting four models to produce a research-style mini-paper on reservoir computing. A three-stage scaffolding workflow prompts the models to derive a limit expression, implement a Lorenz-system experiment with Takens-like structure, and assemble a LaTeX manuscript, with evaluation across mathematics, implementation, and writing quality. Results show that while the models generate coherent, technically structured content and runnable code, they exhibit surface-level understanding and occasional misalignments with the literature, including fabricated references in early drafts. The findings suggest that current LLMs approach human-level capability on integrated math-research-style tasks under time-limited conditions and that scaling predictions may extend to mathematics, though rigorous evaluation and safeguards are needed to ensure precision and true research judgment.
Abstract
As AI capabilities continue to grow exponentially on economically relevant human expert tasks, with task completion horizons doubling every 7 months according to the Model Evaluation and Threat Research (METR), we are interested in how this applies to the task of mathematics research. To explore this, we evaluated the capability of four frontier large language models (LLMs), ChatGPT 5, Claude 4.1 Opus, Gemini 2.5 Pro, and Grok 4, at the task of creating a mini-paper on reservoir computing. All models produced engaging papers with some apparent understanding of various techniques, but were sometimes lead to mistakes by surface level understanding of key ideas. That said, the capabilities on LLMs on this task was likely as good or greater than that predicted by METR.
