InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research
Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, Pengrui Lu, Xiaojie Cai, Lyumanshan Ye, Wenhong Zhu, Yang Xiao, Pengfei Liu
TL;DR
InnovatorBench introduces a first-end-to-end benchmark for AI agents conducting LLM research, spanning 20 tasks across data construction, filtering, augmentation, loss design, reward design, and scaffold construction. Coupled with ResearchGym, a scalable environment that supports long-horizon, multi-machine experiments with a rich action space, the framework enables realistic evaluation of autonomous AI researchers using ReAct-style agents. Empirical results across frontier models show promising capabilities on data-centric tasks but reveal brittleness in algorithm design and long-horizon decision making, with hints improving some domains while potentially hindering others. The study demonstrates that InnovatorBench pushes the boundaries of existing benchmarks by demanding extended runtimes and nuanced reasoning, highlighting the need for better tool use, resource management, and creative problem-solving in end-to-end LLM research agents. Overall, the work provides a foundation for a next generation of code-based research benchmarks and ecosystem-ready evaluation platforms that better reflect real scientific workflows.
Abstract
AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.
