Table of Contents
Fetching ...

ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution

Yubang Wang, Chenxi Zhang, Bowen Chen, Zezheng Huai, Zihao Dai, Xinchi Chen, Yuxin Wang, Yining Zheng, Jingjing Gong, Xipeng Qiu

TL;DR

ResearchEnvBench is introduced, a benchmark for environment synthesis in research code execution, which reveals a substantial gap in current SOTA agents, with failures dominated by incomplete dependency resolution and brittle version coupling.

Abstract

Autonomous agents are increasingly expected to support scientific research, and recent benchmarks report progress in code repair and autonomous experimentation. However, these evaluations typically assume a pre-configured execution environment, which requires resolving complex software dependencies, aligning hardware and framework versions, and configuring distributed execution, yet this capability remains largely unbenchmarked. We introduce ResearchEnvBench, a benchmark for environment synthesis in research code execution. Given a research repository, documentation, and a target execution setting, agents must construct an environment that successfully executes at runtime. Evaluations on diverse research repositories reveal a substantial gap in current SOTA agents, with failures dominated by incomplete dependency resolution and brittle version coupling. ResearchEnvBench provides a realistic testbed for advancing autonomous agents toward reproducible scientific research.

ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution

TL;DR

ResearchEnvBench is introduced, a benchmark for environment synthesis in research code execution, which reveals a substantial gap in current SOTA agents, with failures dominated by incomplete dependency resolution and brittle version coupling.

Abstract

Autonomous agents are increasingly expected to support scientific research, and recent benchmarks report progress in code repair and autonomous experimentation. However, these evaluations typically assume a pre-configured execution environment, which requires resolving complex software dependencies, aligning hardware and framework versions, and configuring distributed execution, yet this capability remains largely unbenchmarked. We introduce ResearchEnvBench, a benchmark for environment synthesis in research code execution. Given a research repository, documentation, and a target execution setting, agents must construct an environment that successfully executes at runtime. Evaluations on diverse research repositories reveal a substantial gap in current SOTA agents, with failures dominated by incomplete dependency resolution and brittle version coupling. ResearchEnvBench provides a realistic testbed for advancing autonomous agents toward reproducible scientific research.
Paper Structure (37 sections, 3 figures, 3 tables)

This paper contains 37 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Agentic closed-loop resolution of research environments under implicit constraints. Starting from a raw research repository (left), an LLM-driven agent iteratively observes execution signals, diagnoses missing dependencies and compatibility constraints, acts by installing/pinning packages or switching toolchains, and verifies by re-running probes (center), aiming to reach an executable research-ready state (right). The multi-layer dependency stack highlights that compatibility constraints propagate downward (Python $\rightarrow$ system $\rightarrow$ CUDA $\rightarrow$ driver/hardware), while failures typically surface upward as logs/tracebacks. Evaluation reports importability ($C_0$), runtime CPU/GPU checks ($C_1$--$C_4$), and an auditable, reproducible report and logs ($C_5$).
  • Figure 2: Repository composition of ResearchEnvBench. We curate 44 post-2024 ML research repositories spanning eight categories: GenVis (generative vision), Depth (depth estimation), Audio (audio & speech), LLM-Inf (LLM inference & acceleration), TrainEng (training & engineering frameworks), VisMM (vision/multimodal foundations), DocAI (document AI / OCR / translation), and AppsEval (applications & evaluation). Slice sizes denote the fraction of repositories per category by count, and each slice shows a representative flagship repository from that category.
  • Figure 3: The Pyramid of Runtime Verification. We structure environment synthesis into a hierarchy of increasing complexity, ranging from static dependency resolution ($C_0$) to multi-GPU distributed training ($C_4$). The right panel illustrates our auxiliary metrics: Efficiency (Time, Token, Size) and Capability Hallucination ($C_5$), which quantifies the discrepancy between the agent's self-reported success (Report) and the ground-truth execution (Real Result).