Table of Contents
Fetching ...

ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

Xin Gui, King Zhu, JinCheng Ren, Qianben Chen, Zekun Moore Wang, Yizhi LI, Xinpeng Liu, Xiaowan Li, Wenli Ren, Linyu Miao, Tianrui Qin, Ziqi Shu, He Zhu, Xiangru Tang, Dingfeng Shi, Jiaheng Liu, Yuchen Eleanor Jiang, Minghao Liu, Ge Zhang, Wangchunshu Zhou

TL;DR

AcadReason addresses the need for a rigorous, cross-domain benchmark of high-level reasoning over academic content. It constructs 50 questions from 430 top-tier papers across CS, econ, law, math, and philosophy, with golden answers, hints, and adaptive checklists. Evaluation across 10+ LLMs and agent frameworks reveals substantial gaps between reasoning models and agents, with GPT-5 achieving a 16% pass rate and 40.6 checklist score, while agents reach 34% and 65.1, respectively. The work demonstrates that structured hints and agent-based knowledge retrieval meaningfully improve performance, and it releases the annotated data to spur further advances in academic reasoning systems.

Abstract

In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.

ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

TL;DR

AcadReason addresses the need for a rigorous, cross-domain benchmark of high-level reasoning over academic content. It constructs 50 questions from 430 top-tier papers across CS, econ, law, math, and philosophy, with golden answers, hints, and adaptive checklists. Evaluation across 10+ LLMs and agent frameworks reveals substantial gaps between reasoning models and agents, with GPT-5 achieving a 16% pass rate and 40.6 checklist score, while agents reach 34% and 65.1, respectively. The work demonstrates that structured hints and agent-based knowledge retrieval meaningfully improve performance, and it releases the annotated data to spur further advances in academic reasoning systems.

Abstract

In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.

Paper Structure

This paper contains 33 sections, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Overview of the AcadReason benchmark construction and evaluation pipeline. It consists of three stages: (1) High-Quality Academic Papers Collection – experts filter 430 papers across 5 domains into 50 top-tier theoretical works; (2) High-Reasoning Research Question Extraction – research questions are refined into formal queries with golden answers containing sufficient reasoning; (3) Checklists and Hints Extraction – background, definition, and methodology hints are provided together with verifiable, independent checklists. For evaluation, candidate responses are compared against golden answers and checklists, and GPT-5 mini assigns final scores.
  • Figure 2: General performance on different domains in Checklist Score
  • Figure 3: Ablation study results. (a) shows the performance gain per model, while (b) presents the average gain across disciplines.
  • Figure 4: Side-by-side comparison of OAgents and GPT-5 on the legal reasoning task.
  • Figure 5: Category Distribution
  • ...and 11 more figures