What's Wrong? Refining Meeting Summaries with LLM Feedback
Frederic Kirstein, Terry Ruas, Bela Gipp
TL;DR
This work proposes a two-stage, multi-LLM refinement framework for meeting summarization: Mistake Identification to locate errors in existing summaries, followed by Summary Refinement to produce higher-quality outputs. It introduces QMSum Mistake, a 200-sample dataset with human-annotated nine error types to study post-hoc corrections and demonstrates that multi-instance, CoT-guided error identification coupled with targeted feedback significantly improves relevance, informativeness, conciseness, and coherence, approaching gold-standard summaries. Through extensive experiments on prompting strategies (Direct vs CoT) and feedback transfer (Direct vs Consolidation), the authors show CoT explanations as feedback with direct transfer generally yields the strongest quality gains, while consolidation is less effective for long transcripts. The work highlights practical implications for robust, scalable post-hoc refinement of meeting summaries and offers a foundation for applying multi-LLM feedback loops to other complex text-generation tasks, supplemented by the release of the QMSum Mistake dataset.
Abstract
Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.
