Table of Contents
Fetching ...

SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

Beichen Guo, Zhiyuan Wen, Jia Gu, Senzhang Wang, Haochen Shi, Ruosong Yang, Shuaiqi Liu

TL;DR

SurveyLens introduces the first discipline-aware benchmark for Automatic Survey Generation (ASG), addressing cross-disciplinary evaluation gaps caused by CS-centric baselines. It builds SurveyLens-1k, a dataset of 1,000 human-written surveys across 10 disciplines, and a dual evaluation framework combining discipline-aware rubrics (via LLM judges with Bradley-Terry weights) with Canonical Alignment (RAMS and TAMS) to assess structural adherence and content grounding. The framework is validated through extensive experiments on 11 SOTA ASG methods, revealing a skeleton-vs-flesh trade-off where specialized systems excels in structure and Deep Research Agents in content, with cross-domain patterns and data-source quality playing critical roles. The results provide actionable guidance for selecting tools tailored to disciplinary needs and establish a pathway for optimizing ASG systems toward discipline-specific evaluation criteria, aligning outputs with human judgments. Throughout, rigorous prompts, SSR representations, and multi-modal content considerations underpin robust cross-disciplinary assessment.

Abstract

The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.

SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

TL;DR

SurveyLens introduces the first discipline-aware benchmark for Automatic Survey Generation (ASG), addressing cross-disciplinary evaluation gaps caused by CS-centric baselines. It builds SurveyLens-1k, a dataset of 1,000 human-written surveys across 10 disciplines, and a dual evaluation framework combining discipline-aware rubrics (via LLM judges with Bradley-Terry weights) with Canonical Alignment (RAMS and TAMS) to assess structural adherence and content grounding. The framework is validated through extensive experiments on 11 SOTA ASG methods, revealing a skeleton-vs-flesh trade-off where specialized systems excels in structure and Deep Research Agents in content, with cross-domain patterns and data-source quality playing critical roles. The results provide actionable guidance for selecting tools tailored to disciplinary needs and establish a pathway for optimizing ASG systems toward discipline-specific evaluation criteria, aligning outputs with human judgments. Throughout, rigorous prompts, SSR representations, and multi-modal content considerations underpin robust cross-disciplinary assessment.

Abstract

The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.
Paper Structure (25 sections, 6 equations, 11 figures, 11 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: The performance comparison among ASG methods (Vanilla LLMs, ASG systems, and Deep Research Agents) in generating surveys across different research disciplines.
  • Figure 2: Overview of the SurveyLens Framework.
  • Figure 3: Evaluation results of 11 ASG systems across four key dimensions (Overall,Content, Outline, and Reference), highlighting the trade-offs between structural planning and content synthesis.
  • Figure 4: The heatmap shows the detailed aspect-wise performance of evaluated ASG systems across Outline, Content, and References. Each cell represents the average score for a specific aspect, numeric details in Appendix \ref{['sec:detailed_results']}.
  • Figure 5: Prompt used to aggregate multiple evaluation aspects into a smaller set of universal, highly aggregated aspects. Used in the call_llm_for_aggregation function.
  • ...and 6 more figures