ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

Hao Kang; Chenyan Xiong

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

Hao Kang, Chenyan Xiong

TL;DR

ResearchArena presents a structured offline benchmark to evaluate large language models on autonomous academic surveys, decomposing the task into information discovery, information selection, and a bonus information organization step. It builds a sizable, reproducible dataset from the Semantic Scholar Open Research Corpus and introduces specialized evaluation metrics to assess retrieval and structural synthesis, revealing that current LLM-based approaches lag behind simple keyword methods and that dedicated reasoning improvements are needed. The work provides baseline results, discusses practical considerations like licensing and dataset updates, and emphasizes the potential for advancing autonomous research agents. Overall, ResearchArena lays the groundwork for principled, domain-specific evaluation of AI agents in scholarly survey synthesis and offers open-source resources to accelerate progress.

Abstract

Large language models (LLMs) excel across many natural language processing tasks but face challenges in domain-specific, analytical tasks such as conducting research surveys. This study introduces ResearchArena, a benchmark designed to evaluate LLMs' capabilities in conducting academic surveys -- a foundational step in academic research. ResearchArena models the process in three stages: (1) information discovery, identifying relevant literature; (2) information selection, evaluating papers' relevance and impact; and (3) information organization, structuring knowledge into hierarchical frameworks such as mind-maps. Notably, mind-map construction is treated as a bonus task, reflecting its supplementary role in survey-writing. To support these evaluations, we construct an offline environment of 12M full-text academic papers and 7.9K survey papers. To ensure ethical compliance, we do not redistribute copyrighted materials; instead, we provide code to construct the environment from the Semantic Scholar Open Research Corpus (S2ORC). Preliminary evaluations reveal that LLM-based approaches underperform compared to simpler keyword-based retrieval methods, though recent reasoning models such as DeepSeek-R1 show slightly better zero-shot performance. These results underscore significant opportunities for advancing LLMs in autonomous research. We open-source the code to construct the ResearchArena benchmark at https://github.com/cxcscmu/ResearchArena.

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

TL;DR

Abstract

Paper Structure (19 sections, 5 figures, 7 tables)

This paper contains 19 sections, 5 figures, 7 tables.

Introduction
Related Work
Collection Methodology
Survey Selection
Reference Linking
Mind-Map Extraction
Dataset Access
Analysis
Benchmark Tasks
Benchmarking
Baselines
Evaluation Results
Conclusion
Limitations
Prompts for the Dataset Collection
...and 4 more sections

Figures (5)

Figure 1: Schematic overview of the construction pipeline for ResearchArena.
Figure 2: Mind-map extraction from the figure to its JSON representation.
Figure 3: Dataset composition analysis with disciplinary distribution, reference coverage, and mind-map complexity. Each of these aspects is critical for benchmark evaluation. Fields of studies like Medicine (Med), Biology (Bio), Physics (Phy), Environmental Science (ES), Computer Science (CS), Engineering (Eng), and Mathematics (Math) are denoted with their abbreviations in the figures.
Figure 4: Sources used while asking Deep Research from Gemini to work on LiDAR Scanning Mechanisms.
Figure 5: Sources used while asking Deep Research from Gemini to work on transfer learning.

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

TL;DR

Abstract

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (5)