Table of Contents
Fetching ...

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan

TL;DR

ResearchGym establishes a reusable, contamination-aware benchmark and execution environment to evaluate LLM agents on end-to-end closed-loop research tasks. By reusing five recent papers and preserving datasets and baselines while withholding authors’ methods, it enables objective, execution-based grading with single-GPU feasibility and standardized integrity checks. Across 15 end-to-end runs with GPT-5, the frontier agent exhibits a clear capability–reliability gap: occasional surpasses of baselines or SOTA on select tasks while most runs lag in task completion and consistent improvement. The framework also analyzes failure modes, resource-efficiency dynamics, and the impact of scaffolding, providing a credible foundation for advancing autonomous research while highlighting risks of overconfidence and reward hacking. Overall, ResearchGym offers a principled platform for measuring, diagnosing, and accelerating autonomous AI research, with open-source tooling and a clear path for extension to more modalities and tasks.

Abstract

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

TL;DR

ResearchGym establishes a reusable, contamination-aware benchmark and execution environment to evaluate LLM agents on end-to-end closed-loop research tasks. By reusing five recent papers and preserving datasets and baselines while withholding authors’ methods, it enables objective, execution-based grading with single-GPU feasibility and standardized integrity checks. Across 15 end-to-end runs with GPT-5, the frontier agent exhibits a clear capability–reliability gap: occasional surpasses of baselines or SOTA on select tasks while most runs lag in task completion and consistent improvement. The framework also analyzes failure modes, resource-efficiency dynamics, and the impact of scaffolding, providing a credible foundation for advancing autonomous research while highlighting risks of overconfidence and reward hacking. Overall, ResearchGym offers a principled platform for measuring, diagnosing, and accelerating autonomous AI research, with open-source tooling and a clear path for extension to more modalities and tasks.

Abstract

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.
Paper Structure (112 sections, 4 equations, 9 figures, 26 tables)

This paper contains 112 sections, 4 equations, 9 figures, 26 tables.

Figures (9)

  • Figure 1: ResearchGym combines the aspects of ideation and experimentation, evaluating LLM agents in executable research codebases with objective scores. rg-agent (w/ GPT-5): (A) Best@3 normalized performance, averaged over all primary sub-tasks, shaded region represents a 95% Confidence Interval generated via percentile bootstrapping. (B) depicts the number of sub-tasks completed. (C) shows mean normalized performance over all primary sub-tasks. Error bars represent the min–max range (3 runs). Metrics defined in (§\ref{['sec:eval-metrics']}).
  • Figure 2: Benchmark Construction Pipeline: LLMs are used to generate compact task cards from award-winning papers. After two-stage filtering, each paper's repository is manually cleaned and finalized into a benchmark task. Benchmark: Consists of 5 curated tasks and 39 sub-tasks across diverse domains, a sub-task is typically validating the proposed method under different datasets/settings.
  • Figure 3: Performance vs. Resources: Plots depict the relationship between best performance and consumed resources across all tasks. The overall trend shows a weak but positive correlation among the two, with diminishing returns.
  • Figure 4: Performance vs. Tool Usage:rg-agent: (A) Illustrates efficient allocation of reasoning tokens to respective tools, (B) Demonstrates a natural exploration/exploitation paradigm overtime with respective tool usage mix, (C) Establishes a moderate negative correlation between action density and performance, and (D) Further shows the diminishing increase in performance with resources.
  • Figure 5: cl Stats.
  • ...and 4 more figures