Table of Contents
Fetching ...

Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research

Yubo Dong, Nianhao You, Yuxuan Hou, Zixun Sun, Yue Zhang, Liang Zhang, Siyuan Zhao, Hehe Fan

TL;DR

This work introduces Super Research, a task for complex autonomous research tasks that integrates structured decomposition into a research plan, super wide retrieval for diverse perspectives, and super deep investigation to resolve uncertainties through iterative queries.

Abstract

While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine-grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph-anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super-complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model's proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: https://cnsdqd-dyb.github.io/Super-Research-Benchmark/

Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research

TL;DR

This work introduces Super Research, a task for complex autonomous research tasks that integrates structured decomposition into a research plan, super wide retrieval for diverse perspectives, and super deep investigation to resolve uncertainties through iterative queries.

Abstract

While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine-grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph-anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super-complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model's proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: https://cnsdqd-dyb.github.io/Super-Research-Benchmark/
Paper Structure (61 sections, 10 equations, 16 figures, 4 tables)

This paper contains 61 sections, 10 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Comparison of Super Research with standard Retrieval-Augmented Generation (RAG), Deep Research, and Wide Search. (a) RAG represents the baseline paradigm, which typically operates with limited depth and width. (b) Deep Research focuses on vertical exploration and recursive chains of evidence to resolve nuanced questions, though it often lacks horizontal width, leading to "tunnel vision". (c) Wide Search prioritizes horizontal data acquisition and large-scale coverage across diverse information nodes, but lacks synthetic depth, which can result in "information overload". (d) Super Research explicitly couples Super Deep investigation with Super Wide retrieval to address highly complex questions requiring long-horizon planning, 100+ retrieval steps, and the synthesis of 1,000+ web pages. It can generate research reports of up to 50 pages, averaging around 100k words per report.
  • Figure 2: Overview of the SuperResearch Benchmark framework. (a) Construction Pipeline: The process starts with the joint definition of 300+ "super hard" open-ended tasks, which undergo rigorous expert vetting. Autonomous agents then execute a long-horizon research process involving 100+ retrieval steps and the synthesis of 1,000+ web pages. The resulting "Gold Standard" consists of a structured Research Graph, canonical reports, and a question-answer (QA) exam. (b) Evaluation Flow: Research reports are audited via Research Graph Projection. The system maps claims from the generated report onto the ground-truth Research Graph to verify Nodes Recall (categorized into atomic facts and insights) and the integrity of Logical Connections, ensuring high-level conclusions are grounded in verifiable evidence. (c) Metrics Suite: A comprehensive five-dimensional suite quantifies model performance, including Coverage & Comprehension ($\mathcal{R}_\text{weighted}$), Logical Consistency ($\mathcal{C}_\text{logic}$), Report Utility ($\mathcal{U}_\text{qa}$), Objectivity Score ($\mathcal{O}_\text{bias}$) and Citation Health.
  • Figure 3: Structural and Functional Characterization of the SuperResearch Benchmark. (a) Quantitative Distribution of Tasks by Domain: A Rose Chart illustrating the distribution of 300 expert-written tasks across 10 specialized domains. (b) Core Benchmark Metrics & Scale: Quantitative statistics characterizing the "ceiling-level" challenge across four key dimensions: Complexity Metrics (measuring reasoning depth and retrieval breadth), Report Statistics (tracking content volume and structure), Graph Composition (quantifying hierarchical knowledge density), and Evaluation Questions (showing the diversity of audit mechanisms). (c) Example Tasks from SuperResearch Benchmark: Representative inquiries exhibiting multi-objective trade-offs and conflicting evidence, serving as a "ceiling-level" challenge.
  • Figure 4: Evaluation Sensitivity Analysis. Compared to the LLM Judge (grey), our Graph Metric (dark blue) shows superior responsiveness to quality fluctuations.
  • Figure 5: Detailed Task Distribution Across Subdomains. Supplementing the main text, this figure visualizes the granular distribution of tasks within each specific subdomain. This breakdown highlights the diversity of the SuperResearch benchmark, confirming its coverage across a wide spectrum of specialized professional fields.
  • ...and 11 more figures