Table of Contents
Fetching ...

Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Juntai Cao, Xiang Zhang, Raymond Li, Chuyuan Li, Chenyu You, Shafiq Joty, Giuseppe Carenini

TL;DR

The paper tackles the challenge of multi-document summarization (MDS) by applying test-time scaling through a prompt-ensemble framework, Multi$^2$, to generate diverse candidate summaries and consolidate them with specialized aggregators. It introduces Consistency-Aware Preference (CAP) and LLM-ACU as evaluation metrics to mitigate positional bias and better capture summary completeness, addressing fundamental evaluation gaps in LLM-based summarization. Empirically, Multi$^2$ shows that prompt ensembles and context-preserving aggregation improve ROUGE, CAP, and LLM-ACU scores on MultiNews and OpenASP, while revealing scaling boundaries (around 5 samples) and model-size interactions. The work provides practical guidance for inference-time optimization in MDS and suggests avenues for dynamic prompt-search methods and extending evaluation metrics to reasoning tasks.

Abstract

Recent advances in test-time scaling have shown promising results in improving Large Language Model (LLM) performance through strategic computation allocation during inference. While this approach has demonstrated strong improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), particularly summarization, remains unexplored. Multi-Document Summarization (MDS), a fundamental task in NLG, presents unique challenges by requiring models to extract and synthesize essential information across multiple lengthy documents. Unlike reasoning tasks, MDS demands a more nuanced approach to prompt design and ensemble methods, as no single "best" prompt can satisfy diverse summarization requirements. We propose a novel framework leveraging test-time scaling for MDS. Our approach employs prompt ensemble techniques to generate multiple candidate summaries using various prompts, then combines them with an aggregator to produce a refined summary. To evaluate our method effectively, we also introduce two new LLM-based metrics: the Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (LLM-ACU) score, which assess summary quality while addressing the positional bias inherent in traditional automatic evaluation. Our extensive experiments demonstrate that this framework significantly enhances summary quality while also revealing the practical scaling boundaries to MDS tasks.

Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

TL;DR

The paper tackles the challenge of multi-document summarization (MDS) by applying test-time scaling through a prompt-ensemble framework, Multi, to generate diverse candidate summaries and consolidate them with specialized aggregators. It introduces Consistency-Aware Preference (CAP) and LLM-ACU as evaluation metrics to mitigate positional bias and better capture summary completeness, addressing fundamental evaluation gaps in LLM-based summarization. Empirically, Multi shows that prompt ensembles and context-preserving aggregation improve ROUGE, CAP, and LLM-ACU scores on MultiNews and OpenASP, while revealing scaling boundaries (around 5 samples) and model-size interactions. The work provides practical guidance for inference-time optimization in MDS and suggests avenues for dynamic prompt-search methods and extending evaluation metrics to reasoning tasks.

Abstract

Recent advances in test-time scaling have shown promising results in improving Large Language Model (LLM) performance through strategic computation allocation during inference. While this approach has demonstrated strong improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), particularly summarization, remains unexplored. Multi-Document Summarization (MDS), a fundamental task in NLG, presents unique challenges by requiring models to extract and synthesize essential information across multiple lengthy documents. Unlike reasoning tasks, MDS demands a more nuanced approach to prompt design and ensemble methods, as no single "best" prompt can satisfy diverse summarization requirements. We propose a novel framework leveraging test-time scaling for MDS. Our approach employs prompt ensemble techniques to generate multiple candidate summaries using various prompts, then combines them with an aggregator to produce a refined summary. To evaluate our method effectively, we also introduce two new LLM-based metrics: the Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (LLM-ACU) score, which assess summary quality while addressing the positional bias inherent in traditional automatic evaluation. Our extensive experiments demonstrate that this framework significantly enhances summary quality while also revealing the practical scaling boundaries to MDS tasks.

Paper Structure

This paper contains 29 sections, 10 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Visualization of our proposed Consistency-aware Preference (CAP) Score for text generation task. Applying LLMs' strong language understanding ability, CAP assign higher score to summary which consistently gets ranked higher by the LLM.
  • Figure 2: Overview of Multi$^2$ summarization inference-time scaling framework. Documents are first summarized by independent LLM agents, each guided by a different prompt from a curated prompt bank and constrained by user requirements. The resulting summaries are then processed by an aggregator (Voter, Context-Preserving Summarizer, or Context-Independent Summarizer) to generate the final consolidated summary.
  • Figure 3: Two-step calculation of CAP using LLM. In this example, LLM prefers the first summary in both step 1 and step 2, resulting in inconsistent evaluation.
  • Figure 4: Scaling boundaries based on preference (CAP) scores. For MultiNews, CPS aggregation exhibits inverse scaling effects after 5 samples, yet through ensemble scaling, gpt-4o-mini achieves performance comparable to gpt-4o. In the OpenASP dataset, performance for both models plateaus after 5 samples.
  • Figure 5: Scaling boundaries based on completeness (LLM-ACU) scores. Across both datasets, CPS consistently outperforms CIS. The smaller model (gpt-4o-mini) demonstrates greater performance gains from ensemble scaling, with improvements continuing at higher sample counts compared to the larger model (gpt-4o), which reaches its scaling plateau earlier.