Table of Contents
Fetching ...

Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch

TL;DR

SummQ introduces an adversarial multi-agent framework for long-document summarization that couples summarization and quizzing tasks. It deploys four agent types—Summary Generators, Quiz Generators, Summary Reviewers, and Quiz Reviewers—plus an Examinee to ensure quiz questions can be answered from the summary, enabling iterative refinement across rounds. The approach uses a four-phase generator process (independent drafting, aggregation, best draft selection, collective voting) and a four-phase reviewer process (independent reviewing, issue categorization, contested issue debate, final decision) to produce high-quality, verifiable summaries. Empirical results on MENSA, BookSum, and GovReport show state-of-the-art performance across ROUGE, BERTScore, LLM-as-a-Judge, and human evaluations, with SummQcombo generally outperforming SummQsolo and baselines; analyses reveal how iteration count, agent count, and backbone quality shape performance and cost. The work demonstrates that adversarial agentic collaboration with quiz-based quality checks can significantly improve the quality, coverage, and verifiability of long-document summaries, offering a scalable blueprint for robust abstractive summarization.

Abstract

Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.

Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

TL;DR

SummQ introduces an adversarial multi-agent framework for long-document summarization that couples summarization and quizzing tasks. It deploys four agent types—Summary Generators, Quiz Generators, Summary Reviewers, and Quiz Reviewers—plus an Examinee to ensure quiz questions can be answered from the summary, enabling iterative refinement across rounds. The approach uses a four-phase generator process (independent drafting, aggregation, best draft selection, collective voting) and a four-phase reviewer process (independent reviewing, issue categorization, contested issue debate, final decision) to produce high-quality, verifiable summaries. Empirical results on MENSA, BookSum, and GovReport show state-of-the-art performance across ROUGE, BERTScore, LLM-as-a-Judge, and human evaluations, with SummQcombo generally outperforming SummQsolo and baselines; analyses reveal how iteration count, agent count, and backbone quality shape performance and cost. The work demonstrates that adversarial agentic collaboration with quiz-based quality checks can significantly improve the quality, coverage, and verifiability of long-document summaries, offering a scalable blueprint for robust abstractive summarization.

Abstract

Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.

Paper Structure

This paper contains 52 sections, 9 figures, 7 tables, 3 algorithms.

Figures (9)

  • Figure 1: The overall framework of SummQ. It consists of two tasks, summarization and quizzing, and two types of agents: generator and reviewer, resulting in four groups of agents: Summary Generators, Quiz Generators, Summary Reviewers, and Quiz Reviewers. Additionally, we include an Examinee agent to check if quiz questions can be answered by the summary.
  • Figure 2: The comparison between SummQcombo and baselines judged by GPT-5 on MENSA during iteration, where there are three GPT-4o agents in each component of SummQcombo.
  • Figure 3: Human evaluation results comparing GPT-4o, o3, SummQcombo, and SummQcomboR.
  • Figure 4: Results on MENSA obtained by SummQ, where one component contains multiple agents while other components contain only a single agent.
  • Figure 5: Results with different number of iterations $T_{\text{iter}}$ on MENSA with the SummQcombo. All agents are GPT-4o.
  • ...and 4 more figures