Table of Contents
Fetching ...

PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature

Daoyu Wang, Mingyue Cheng, Shuo Yu, Zirui Liu, Ze Guo, Qi Liu

TL;DR

PaperArena introduces a challenging benchmark and open-source evaluation platform (PaperArena-Hub) for tool-augmented agentic reasoning on scientific literature. It constructs 784 multi-step, multimodal QA pairs across 100 AI papers, requiring cross-document integration and planning over a modular toolset. Experimental results reveal a substantial gap between state-of-the-art LLM-powered agents and human experts, with inefficiencies in tool usage and planning, even in multi-agent setups. The work provides insights into failure modes and offers a scalable framework for advancing intelligent, tool-enabled scientific discovery.

Abstract

Understanding and reasoning on the web-scale scientific literature is a crucial touchstone for large language model (LLM) based agents designed to support complex knowledge-intensive tasks. However, existing works are mainly restricted to tool-free tasks within isolated papers, largely due to the lack of a benchmark for cross-paper reasoning and multi-tool orchestration in real research scenarios. In this work, we propose PaperArena, an evaluation benchmark for agents to address real-world research questions that typically require integrating information across multiple papers with the assistance of external tools. Given a research question, agents should integrate diverse formats across multiple papers through reasoning and interacting with appropriate tools, thereby producing a well-grounded answer. To support standardized evaluation, we provide a modular and extensible platform for agent execution, offering tools such as multimodal parsing, context retrieval, and programmatic computation. Experimental results reveal that even the most advanced LLM powering a well-established agent system achieves merely 38.78% average accuracy. On the hard subset, accuracy drops to only 18.47%, highlighting great potential for improvement. We also present several empirical findings, including that all agents tested exhibit inefficient tool usage, often invoking more tools than necessary to solve a task. We invite the community to adopt PaperArena to develop and evaluate more capable agents for scientific discovery. Our code and data are available https://github.com/Melmaphother/PaperArena.

PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature

TL;DR

PaperArena introduces a challenging benchmark and open-source evaluation platform (PaperArena-Hub) for tool-augmented agentic reasoning on scientific literature. It constructs 784 multi-step, multimodal QA pairs across 100 AI papers, requiring cross-document integration and planning over a modular toolset. Experimental results reveal a substantial gap between state-of-the-art LLM-powered agents and human experts, with inefficiencies in tool usage and planning, even in multi-agent setups. The work provides insights into failure modes and offers a scalable framework for advancing intelligent, tool-enabled scientific discovery.

Abstract

Understanding and reasoning on the web-scale scientific literature is a crucial touchstone for large language model (LLM) based agents designed to support complex knowledge-intensive tasks. However, existing works are mainly restricted to tool-free tasks within isolated papers, largely due to the lack of a benchmark for cross-paper reasoning and multi-tool orchestration in real research scenarios. In this work, we propose PaperArena, an evaluation benchmark for agents to address real-world research questions that typically require integrating information across multiple papers with the assistance of external tools. Given a research question, agents should integrate diverse formats across multiple papers through reasoning and interacting with appropriate tools, thereby producing a well-grounded answer. To support standardized evaluation, we provide a modular and extensible platform for agent execution, offering tools such as multimodal parsing, context retrieval, and programmatic computation. Experimental results reveal that even the most advanced LLM powering a well-established agent system achieves merely 38.78% average accuracy. On the hard subset, accuracy drops to only 18.47%, highlighting great potential for improvement. We also present several empirical findings, including that all agents tested exhibit inefficient tool usage, often invoking more tools than necessary to solve a task. We invite the community to adopt PaperArena to develop and evaluate more capable agents for scientific discovery. Our code and data are available https://github.com/Melmaphother/PaperArena.

Paper Structure

This paper contains 51 sections, 6 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Confronted with vast literature, researchers must orchestrate a diverse toolset for cross-paper reasoning to answer complex scientific questions.
  • Figure 2: The effect of our sampling strategy in creating a more representative and comprehensive paper subset.
  • Figure 3: Key features of PaperArena, highlighting four core capabilities required for scientific reasoning agents.
  • Figure 4: Illustration of the PaperArena benchmark construction and PaperArena-Hub platform details, featuring the tool-centric QA generation pipeline and the evaluation platform including single-agent and multi-agent systems.
  • Figure 5: Comparison of theoretical and practical number of tool calls by Gemini 2.5 Pro on the single-agent system, which reveals the far greater number of practical calls.
  • ...and 4 more figures