Table of Contents
Fetching ...

The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning

Max Zimmer, Nico Pelleriti, Christophe Roux, Sebastian Pokutta

Abstract

AI tools and agents are reshaping how researchers work, from proving theorems to training neural networks. Yet for many, it remains unclear how these tools fit into everyday research practice. This paper is a practical guide to AI-assisted research in mathematics and machine learning: We discuss how researchers can use modern AI systems productively, where these systems help most, and what kinds of guardrails are needed to use them responsibly. It is organized into three parts: (I) a five-level taxonomy of AI integration, (II) an open-source framework that, through a set of methodological rules formulated as agent prompts, turns CLI coding agents (e.g., Claude Code, Codex CLI, OpenCode) into autonomous research assistants, and (III) case studies from deep learning and mathematics. The framework runs inside a sandboxed container, works with any frontier LLM through existing CLI agents, is simple enough to install and use within minutes, and scales from personal-laptop prototyping to multi-node, multi-GPU experimentation across compute clusters. In practice, our longest autonomous session ran for over 20 hours, dispatching independent experiments across multiple nodes without human intervention. We stress that our framework is not intended to replace the researcher in the loop, but to augment them. Our code is publicly available at https://github.com/ZIB-IOL/The-Agentic-Researcher.

The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning

Abstract

AI tools and agents are reshaping how researchers work, from proving theorems to training neural networks. Yet for many, it remains unclear how these tools fit into everyday research practice. This paper is a practical guide to AI-assisted research in mathematics and machine learning: We discuss how researchers can use modern AI systems productively, where these systems help most, and what kinds of guardrails are needed to use them responsibly. It is organized into three parts: (I) a five-level taxonomy of AI integration, (II) an open-source framework that, through a set of methodological rules formulated as agent prompts, turns CLI coding agents (e.g., Claude Code, Codex CLI, OpenCode) into autonomous research assistants, and (III) case studies from deep learning and mathematics. The framework runs inside a sandboxed container, works with any frontier LLM through existing CLI agents, is simple enough to install and use within minutes, and scales from personal-laptop prototyping to multi-node, multi-GPU experimentation across compute clusters. In practice, our longest autonomous session ran for over 20 hours, dispatching independent experiments across multiple nodes without human intervention. We stress that our framework is not intended to replace the researcher in the loop, but to augment them. Our code is publicly available at https://github.com/ZIB-IOL/The-Agentic-Researcher.
Paper Structure (75 sections, 10 figures, 1 table)

This paper contains 75 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: A command-line interface (CLI) agent during an autonomous research session: over 8 hours in, managing six parallel GPU training runs and three scheduled monitoring tasks. The same framework supports mathematical derivations, proofs, and verification alongside computational experiments. The agent is idle, consuming no tokens while waiting for a status check to complete.
  • Figure 2: Setting up a research project. Top: the three categories of input the researcher provides, with their conceptual role (dark) and concrete realization (light). Bottom: two examples from our case studies: a deep learning project (\ref{['sec:case_a']}) and a mathematics project (\ref{['sec:case_d']}).
  • Figure 3: Overview of the agentic research framework. Top: The researcher writes a persistent instruction file that governs the CLI agent operating within a sandboxed environment. Bottom: Each experiment follows an eight-step loop.
  • Figure 4: Final validation perplexity [sic] from the agent's report in \ref{['sec:case_a']}. Lower is better. The dashed line marks the Muon baseline; the agent's modifications achieve ${\sim}5\%$ improvement over Muon and ${\sim}8\%$ over AdamW.
  • Figure 5: Training curves [sic] from the agent's report in \ref{['sec:case_a']}. Left: full training run. Right: final 3,000 iterations (zoomed). The agent's optimizer modifications consistently outperform both Muon and AdamW baselines throughout training, not only in the final iterations. Note that here, the agent named the new method NewMuon, which is inconsistent with the naming in \ref{['fig:case_a_ppl']}.
  • ...and 5 more figures