Table of Contents
Fetching ...

What's Wrong? Refining Meeting Summaries with LLM Feedback

Frederic Kirstein, Terry Ruas, Bela Gipp

TL;DR

This work proposes a two-stage, multi-LLM refinement framework for meeting summarization: Mistake Identification to locate errors in existing summaries, followed by Summary Refinement to produce higher-quality outputs. It introduces QMSum Mistake, a 200-sample dataset with human-annotated nine error types to study post-hoc corrections and demonstrates that multi-instance, CoT-guided error identification coupled with targeted feedback significantly improves relevance, informativeness, conciseness, and coherence, approaching gold-standard summaries. Through extensive experiments on prompting strategies (Direct vs CoT) and feedback transfer (Direct vs Consolidation), the authors show CoT explanations as feedback with direct transfer generally yields the strongest quality gains, while consolidation is less effective for long transcripts. The work highlights practical implications for robust, scalable post-hoc refinement of meeting summaries and offers a foundation for applying multi-LLM feedback loops to other complex text-generation tasks, supplemented by the release of the QMSum Mistake dataset.

Abstract

Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.

What's Wrong? Refining Meeting Summaries with LLM Feedback

TL;DR

This work proposes a two-stage, multi-LLM refinement framework for meeting summarization: Mistake Identification to locate errors in existing summaries, followed by Summary Refinement to produce higher-quality outputs. It introduces QMSum Mistake, a 200-sample dataset with human-annotated nine error types to study post-hoc corrections and demonstrates that multi-instance, CoT-guided error identification coupled with targeted feedback significantly improves relevance, informativeness, conciseness, and coherence, approaching gold-standard summaries. Through extensive experiments on prompting strategies (Direct vs CoT) and feedback transfer (Direct vs Consolidation), the authors show CoT explanations as feedback with direct transfer generally yields the strongest quality gains, while consolidation is less effective for long transcripts. The work highlights practical implications for robust, scalable post-hoc refinement of meeting summaries and offers a foundation for applying multi-LLM feedback loops to other complex text-generation tasks, supplemented by the release of the QMSum Mistake dataset.

Abstract

Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.
Paper Structure (27 sections, 9 figures, 9 tables)

This paper contains 27 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Overview of the two-stage refinement protocol displaying the assessed variants. The Mistake Identification block is analyzed \ref{['sec:MistakeIdentification']} and the Refinement block in \ref{['sec:SummaryRefinement']}.
  • Figure 2: Average mistake identification accuracy, false positive and false negative rates for each MIP variant. For the accuracy, higher score is better. For the false positive/negative rate, lower is better.
  • Figure 3: Ranking of multiple summaries refined for up to 10 rounds. The red dotted line indicates the ranking of the GOLD summaries.
  • Figure 4: MIP prompt template in the format for multi-instance usage. In the single-instance setup, the definition and example blocks are repeated for every error type.
  • Figure 5: Prompt tehmplate used to consolidate a feedback for the consolidation TP. The model is tasked to extract from the exhaustive feedback what the refinement model should consider for editing.
  • ...and 4 more figures