Table of Contents
Fetching ...

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, Pengrui Lu, Xiaojie Cai, Lyumanshan Ye, Wenhong Zhu, Yang Xiao, Pengfei Liu

TL;DR

InnovatorBench introduces a first-end-to-end benchmark for AI agents conducting LLM research, spanning 20 tasks across data construction, filtering, augmentation, loss design, reward design, and scaffold construction. Coupled with ResearchGym, a scalable environment that supports long-horizon, multi-machine experiments with a rich action space, the framework enables realistic evaluation of autonomous AI researchers using ReAct-style agents. Empirical results across frontier models show promising capabilities on data-centric tasks but reveal brittleness in algorithm design and long-horizon decision making, with hints improving some domains while potentially hindering others. The study demonstrates that InnovatorBench pushes the boundaries of existing benchmarks by demanding extended runtimes and nuanced reasoning, highlighting the need for better tool use, resource management, and creative problem-solving in end-to-end LLM research agents. Overall, the work provides a foundation for a next generation of code-based research benchmarks and ecosystem-ready evaluation platforms that better reflect real scientific workflows.

Abstract

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

TL;DR

InnovatorBench introduces a first-end-to-end benchmark for AI agents conducting LLM research, spanning 20 tasks across data construction, filtering, augmentation, loss design, reward design, and scaffold construction. Coupled with ResearchGym, a scalable environment that supports long-horizon, multi-machine experiments with a rich action space, the framework enables realistic evaluation of autonomous AI researchers using ReAct-style agents. Empirical results across frontier models show promising capabilities on data-centric tasks but reveal brittleness in algorithm design and long-horizon decision making, with hints improving some domains while potentially hindering others. The study demonstrates that InnovatorBench pushes the boundaries of existing benchmarks by demanding extended runtimes and nuanced reasoning, highlighting the need for better tool use, resource management, and creative problem-solving in end-to-end LLM research agents. Overall, the work provides a foundation for a next generation of code-based research benchmarks and ecosystem-ready evaluation platforms that better reflect real scientific workflows.

Abstract

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.

Paper Structure

This paper contains 34 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of InnovatorBench and ResearchGym. InnovatorBench consists of 20 LLM research tasks from 6 research domains. Each task requires the most powerful agent at most 36 hours to complete. ResearchGym provides the infrastructure support and a rich action space for the agent to work in InnovatorBench.
  • Figure 2: An illustrative LLM research task from DAPO yu2025dapo.(a) Datasets. The agent receives a task description and a starter workspace; an optional hint is only revealed upon the agent’s explicit request via the view_hint tool at a final score penalty. (b) Evaluations. An evaluation directory includes evaluation scripts and reference data. Evaluation is performed externally using scripts and reference data. The agent submits its output via the eval tool and receives a score with feedback, preventing hacking. The full example is in Appendix \ref{['app:extended-innovatorbench-examples']}.
  • Figure 3: Overall structure between InnovatorBench, ResearchGym, and agents. ResearchGym's workspace is initialized with the InnovatorBench dataset. The agent receives a task description, reasons over observations, and sends actions on a target computer. The agent iterates this process, optionally using view_hint for hints and eval for submitting answers, until calling finish. ResearchGym then performs a final evaluation and saves a state snapshot.
  • Figure 4: Four representative cases of agents' actual failures. (a) Impatience, (b) Resource Mismanagement, (c) Selection of Suboptimal Library, (d) Template-based Reasoning. Some spaces have been removed in the figure.
  • Figure 5: Test-time scaling: InnovatorBench vs. PaperBench starace2025paperbench. PaperBench's result comes from the original paper. Agents require about $6.5\times$ longer test-time to reach the saturation point on InnovatorBench, highlighting that our benchmark’s difficulty stems from the need for extended runtime before performance plateaus. DC, DF, DA, LD, RD, and SC are six subtasks in InnovatorBench, which are Data Construction, Data Filtering, Data Augmentation, Loss Design, Reward Design, and Scaffold Construction, respectively.
  • ...and 1 more figures