SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Yujiong Shen; Yajie Yang; Zhiheng Xi; Binze Hu; Huayu Sha; Jiazheng Zhang; Qiyuan Peng; Junlin Shang; Jixuan Huang; Yutao Fan; Jingqi Tong; Shihan Dou; Ming Zhang; Lei Bai; Zhenfei Yin; Tao Gui; Xingjun Ma; Qi Zhang; Xuanjing Huang; Yu-Gang Jiang

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

TL;DR

SciAgentGym is introduced, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure and SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories.

Abstract

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

TL;DR

Abstract

Paper Structure (62 sections, 12 equations, 13 figures, 14 tables, 1 algorithm)

This paper contains 62 sections, 12 equations, 13 figures, 14 tables, 1 algorithm.

Introduction
Related Work
SciAgentGym
Tool Design.
Closed-loop Interaction.
SciAgentBench
Dataset Statistics.
Evaluation Metrics.
SciForge: Execution-Grounded Synthesis
Tool Dependency Graph and Program Sampling
Forward Execution with Environment
Trace-to-Question Generation
Experiments
Experimental Setup
Models.
...and 47 more sections

Figures (13)

Figure 1: Benchmarking multi-step scientific tool-use in SciAgentGym. A representative trajectory where an LLM agent interacts with the environment to solve a complex chemistry task. This example illustrates the core agent capabilities demonstrated within our environment: orchestrating domain-specific tools, recovering from errors, and synthesizing final outputs.
Figure 2: Overview of SciAgentGym. The left panel depicts the foundational Environment, integrating specialized toolkits and sandboxed infrastructure to handle multi-disciplinary multi-modal tasks. The right panel illustrates the core capabilities supported by this environment, including SciAgentBench for evaluation, an interactive Agent Interface for iterative multi-step reasoning with feedback, and a Training method to enhance performance via tool graph sampling and execution-grounded verification.
Figure 3: t-SNE visualization of tool embeddings by subdomain. Tools are colored by subdomain, illustrating the semantic diversity and broad coverage of the SciAgentGym toolkit.
Figure 4: Failure analysis in SciAgentBench.(Left) Tool-call frequency vs. Success Rate; the negative correlation ($r=-0.18$) implies ineffective loops. (Middle) Feedback metrics: Adaptation (responding to errors), Tuning (parameter refinement), Switching (strategy pivoting), and Loop Escape (1- identical repetition rate). (Right) Error recovery rates over step intervals.
Figure 5: Scaling behavior: Tool-augmented (ReAct) performance improves with data size, while tool-free reasoning saturates early, highlighting the value of large-scale tool-use trajectories.
...and 8 more figures

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

TL;DR

Abstract

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (13)