SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation
Beichen Guo, Zhiyuan Wen, Jia Gu, Senzhang Wang, Haochen Shi, Ruosong Yang, Shuaiqi Liu
TL;DR
SurveyLens introduces the first discipline-aware benchmark for Automatic Survey Generation (ASG), addressing cross-disciplinary evaluation gaps caused by CS-centric baselines. It builds SurveyLens-1k, a dataset of 1,000 human-written surveys across 10 disciplines, and a dual evaluation framework combining discipline-aware rubrics (via LLM judges with Bradley-Terry weights) with Canonical Alignment (RAMS and TAMS) to assess structural adherence and content grounding. The framework is validated through extensive experiments on 11 SOTA ASG methods, revealing a skeleton-vs-flesh trade-off where specialized systems excels in structure and Deep Research Agents in content, with cross-domain patterns and data-source quality playing critical roles. The results provide actionable guidance for selecting tools tailored to disciplinary needs and establish a pathway for optimizing ASG systems toward discipline-specific evaluation criteria, aligning outputs with human judgments. Throughout, rigorous prompts, SSR representations, and multi-modal content considerations underpin robust cross-disciplinary assessment.
Abstract
The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.
