Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing
Juntai Cao, Xiang Zhang, Raymond Li, Chuyuan Li, Chenyu You, Shafiq Joty, Giuseppe Carenini
TL;DR
The paper tackles the challenge of multi-document summarization (MDS) by applying test-time scaling through a prompt-ensemble framework, Multi$^2$, to generate diverse candidate summaries and consolidate them with specialized aggregators. It introduces Consistency-Aware Preference (CAP) and LLM-ACU as evaluation metrics to mitigate positional bias and better capture summary completeness, addressing fundamental evaluation gaps in LLM-based summarization. Empirically, Multi$^2$ shows that prompt ensembles and context-preserving aggregation improve ROUGE, CAP, and LLM-ACU scores on MultiNews and OpenASP, while revealing scaling boundaries (around 5 samples) and model-size interactions. The work provides practical guidance for inference-time optimization in MDS and suggests avenues for dynamic prompt-search methods and extending evaluation metrics to reasoning tasks.
Abstract
Recent advances in test-time scaling have shown promising results in improving Large Language Model (LLM) performance through strategic computation allocation during inference. While this approach has demonstrated strong improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), particularly summarization, remains unexplored. Multi-Document Summarization (MDS), a fundamental task in NLG, presents unique challenges by requiring models to extract and synthesize essential information across multiple lengthy documents. Unlike reasoning tasks, MDS demands a more nuanced approach to prompt design and ensemble methods, as no single "best" prompt can satisfy diverse summarization requirements. We propose a novel framework leveraging test-time scaling for MDS. Our approach employs prompt ensemble techniques to generate multiple candidate summaries using various prompts, then combines them with an aggregator to produce a refined summary. To evaluate our method effectively, we also introduce two new LLM-based metrics: the Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (LLM-ACU) score, which assess summary quality while addressing the positional bias inherent in traditional automatic evaluation. Our extensive experiments demonstrate that this framework significantly enhances summary quality while also revealing the practical scaling boundaries to MDS tasks.
