Table of Contents
Fetching ...

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu

TL;DR

SurGE provides a reproducible benchmark for automated scientific survey generation by pairing a large arXiv-based corpus with 205 expert-ground-truth surveys and a fully automated, multi-dimensional evaluation framework (comprehensiveness, citation accuracy, structural quality, content quality). The two-stage task—retrieve relevant papers and generate a structured survey with accurate citations—enables standardized comparisons across retrieval and generation methods. Experimental results show a substantial gap between state-of-the-art retrieval/generation pipelines and expert surveys, highlighting issues such as incomplete topic coverage and citation hallucinations, and revealing a trade-off between local structural accuracy and global topic coverage. By making code, data, and models openly available, SurGE aims to catalyze advances at the intersection of information retrieval and long-form scientific writing.

Abstract

The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

TL;DR

SurGE provides a reproducible benchmark for automated scientific survey generation by pairing a large arXiv-based corpus with 205 expert-ground-truth surveys and a fully automated, multi-dimensional evaluation framework (comprehensiveness, citation accuracy, structural quality, content quality). The two-stage task—retrieve relevant papers and generate a structured survey with accurate citations—enables standardized comparisons across retrieval and generation methods. Experimental results show a substantial gap between state-of-the-art retrieval/generation pipelines and expert surveys, highlighting issues such as incomplete topic coverage and citation hallucinations, and revealing a trade-off between local structural accuracy and global topic coverage. By making code, data, and models openly available, SurGE aims to catalyze advances at the intersection of information retrieval and long-form scientific writing.

Abstract

The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

Paper Structure

This paper contains 37 sections, 9 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the SurGE Benchmark. (a) Summary statistics of the curated survey dataset and its associated retrieval corpus. (b) Metadata of the pre-processed survey dataset used in SurGE.