Multi-Agent Taskforce Collaboration: Self-Correction of Compounding Errors in Long-Form Literature Review Generation
Zhi Zhang, Yan Liu, Zhejing Hu, Gong Chen, Sheng-hua Zhong, Jiannong Cao
TL;DR
This work tackles the problem of compounding errors in automated long-form literature review generation by introducing the Multi-Agent Taskforce Collaboration (MATC) framework. MATC orchestrates a manager agent with three specialized taskforces—exploration for grounded outlining, exploitation for iterative fact location and drafting, and feedback for experience-based self-correction—to mitigate error propagation across the workflow. Across AutoSurvey, SurveyEval, and the new TopSurvey benchmark, MATC achieves state-of-the-art performance in both citation quality and content quality, with ablation studies confirming the critical roles of each taskforce. Real-world deployment further demonstrates practicality and efficiency, evidencing MATC’s robustness and scalability in producing large volumes of literature reviews.
Abstract
Compounding error is critical in long-form literature review generation, where minor inaccuracies cascade and amplify across subsequent steps, severely compromising the faithfulness of the final output. To address this challenge, we propose the Multi-Agent Taskforce Collaboration (MATC) framework, which proactively mitigates errors by orchestrating LLM-based agents into three specialized taskforces: (1) an exploration taskforce that interleaves retrieval and outlining using a tree-based strategy to establish a grounded structure; (2) an exploitation taskforce that iteratively cycles between fact location and draft refinement to ensure evidential support; and (3) a feedback taskforce that leverages historical experience for self-correction before errors propagate. Experimental results show that MATC achieves state-of-the-art performance on existing benchmarks (AutoSurvey and SurveyEval), significantly outperforming strong baselines in both citation quality (e.g., +15.7% recall) and content quality. We further contribute TopSurvey, a new large-scale benchmark of 195 peer-reviewed survey topics, on which MATC maintains robust performance, demonstrating its generalizability.
