Table of Contents
Fetching ...

Multi-Agent Taskforce Collaboration: Self-Correction of Compounding Errors in Long-Form Literature Review Generation

Zhi Zhang, Yan Liu, Zhejing Hu, Gong Chen, Sheng-hua Zhong, Jiannong Cao

TL;DR

This work tackles the problem of compounding errors in automated long-form literature review generation by introducing the Multi-Agent Taskforce Collaboration (MATC) framework. MATC orchestrates a manager agent with three specialized taskforces—exploration for grounded outlining, exploitation for iterative fact location and drafting, and feedback for experience-based self-correction—to mitigate error propagation across the workflow. Across AutoSurvey, SurveyEval, and the new TopSurvey benchmark, MATC achieves state-of-the-art performance in both citation quality and content quality, with ablation studies confirming the critical roles of each taskforce. Real-world deployment further demonstrates practicality and efficiency, evidencing MATC’s robustness and scalability in producing large volumes of literature reviews.

Abstract

Compounding error is critical in long-form literature review generation, where minor inaccuracies cascade and amplify across subsequent steps, severely compromising the faithfulness of the final output. To address this challenge, we propose the Multi-Agent Taskforce Collaboration (MATC) framework, which proactively mitigates errors by orchestrating LLM-based agents into three specialized taskforces: (1) an exploration taskforce that interleaves retrieval and outlining using a tree-based strategy to establish a grounded structure; (2) an exploitation taskforce that iteratively cycles between fact location and draft refinement to ensure evidential support; and (3) a feedback taskforce that leverages historical experience for self-correction before errors propagate. Experimental results show that MATC achieves state-of-the-art performance on existing benchmarks (AutoSurvey and SurveyEval), significantly outperforming strong baselines in both citation quality (e.g., +15.7% recall) and content quality. We further contribute TopSurvey, a new large-scale benchmark of 195 peer-reviewed survey topics, on which MATC maintains robust performance, demonstrating its generalizability.

Multi-Agent Taskforce Collaboration: Self-Correction of Compounding Errors in Long-Form Literature Review Generation

TL;DR

This work tackles the problem of compounding errors in automated long-form literature review generation by introducing the Multi-Agent Taskforce Collaboration (MATC) framework. MATC orchestrates a manager agent with three specialized taskforces—exploration for grounded outlining, exploitation for iterative fact location and drafting, and feedback for experience-based self-correction—to mitigate error propagation across the workflow. Across AutoSurvey, SurveyEval, and the new TopSurvey benchmark, MATC achieves state-of-the-art performance in both citation quality and content quality, with ablation studies confirming the critical roles of each taskforce. Real-world deployment further demonstrates practicality and efficiency, evidencing MATC’s robustness and scalability in producing large volumes of literature reviews.

Abstract

Compounding error is critical in long-form literature review generation, where minor inaccuracies cascade and amplify across subsequent steps, severely compromising the faithfulness of the final output. To address this challenge, we propose the Multi-Agent Taskforce Collaboration (MATC) framework, which proactively mitigates errors by orchestrating LLM-based agents into three specialized taskforces: (1) an exploration taskforce that interleaves retrieval and outlining using a tree-based strategy to establish a grounded structure; (2) an exploitation taskforce that iteratively cycles between fact location and draft refinement to ensure evidential support; and (3) a feedback taskforce that leverages historical experience for self-correction before errors propagate. Experimental results show that MATC achieves state-of-the-art performance on existing benchmarks (AutoSurvey and SurveyEval), significantly outperforming strong baselines in both citation quality (e.g., +15.7% recall) and content quality. We further contribute TopSurvey, a new large-scale benchmark of 195 peer-reviewed survey topics, on which MATC maintains robust performance, demonstrating its generalizability.

Paper Structure

This paper contains 14 sections, 14 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of the error accumulation phenomenon.
  • Figure 2: The overall framework of Multi-agent Taskforce Collaboration (MATC) for literature review generation.
  • Figure 3: Comparison of automatic survey generation methods on the SurveyEval benchmark. Higher scores indicate better performance.