QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization
Mohamed Imed Eddine Ghebriout, Gaël Guibon, Ivan Lerner, Emmanuel Vincent
TL;DR
QUARTZ addresses unsupervised, task-oriented dialogue summarization by leveraging a pool of LLMs to generate diverse summaries and task-focused QA pairs in a zero-shot setting. It employs a two-stage QA-based evaluation using Kendall tau aggregation and Mean Reciprocal Rank to identify the most informative summaries, followed by unsupervised fine-tuning of the top summarizer with a task-conditioned likelihood objective, formalized as maximizing P(S^* | D, T, θ). Across SAMSum, DialogSum, MTS-Dialog, and SimSAMU, QUARTZ achieves competitive or superior performance to fully supervised state-of-the-art methods, with clear gains in task relevance, factual soundness, and robustness in low-resource scenarios. The approach reduces annotation costs and has practical impact for domains like healthcare and business meetings, enabling coherent, task-focused summaries while maintaining transparency through open-source LLM usage and explicit evaluation protocols. The work also highlights the value of a diverse LLM pool and outlines future directions, including iterative QUARTZ and alternative fine-tuning strategies to further enhance alignment and quality.
Abstract
Dialogue summarization aims to distill the core meaning of a conversation into a concise text. This is crucial for reducing the complexity and noise inherent in dialogue-heavy applications. While recent approaches typically train language models to mimic human-written summaries, such supervision is costly and often results in outputs that lack task-specific focus limiting their effectiveness in downstream applications, such as medical tasks. In this paper, we propose \app, a framework for task-oriented utility-based dialogue summarization. \app starts by generating multiple summaries and task-oriented question-answer pairs from a dialogue in a zero-shot manner using a pool of large language models (LLMs). The quality of the generated summaries is evaluated by having LLMs answer task-related questions before \textit{(i)} selecting the best candidate answers and \textit{(ii)} identifying the most informative summary based on these answers. Finally, we fine-tune the best LLM on the selected summaries. When validated on multiple datasets, \app demonstrates its effectiveness by achieving competitive results in various zero-shot settings, rivaling fully-supervised State-of-the-Art (SotA) methods.
