Table of Contents
Fetching ...

QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization

Mohamed Imed Eddine Ghebriout, Gaël Guibon, Ivan Lerner, Emmanuel Vincent

TL;DR

QUARTZ addresses unsupervised, task-oriented dialogue summarization by leveraging a pool of LLMs to generate diverse summaries and task-focused QA pairs in a zero-shot setting. It employs a two-stage QA-based evaluation using Kendall tau aggregation and Mean Reciprocal Rank to identify the most informative summaries, followed by unsupervised fine-tuning of the top summarizer with a task-conditioned likelihood objective, formalized as maximizing P(S^* | D, T, θ). Across SAMSum, DialogSum, MTS-Dialog, and SimSAMU, QUARTZ achieves competitive or superior performance to fully supervised state-of-the-art methods, with clear gains in task relevance, factual soundness, and robustness in low-resource scenarios. The approach reduces annotation costs and has practical impact for domains like healthcare and business meetings, enabling coherent, task-focused summaries while maintaining transparency through open-source LLM usage and explicit evaluation protocols. The work also highlights the value of a diverse LLM pool and outlines future directions, including iterative QUARTZ and alternative fine-tuning strategies to further enhance alignment and quality.

Abstract

Dialogue summarization aims to distill the core meaning of a conversation into a concise text. This is crucial for reducing the complexity and noise inherent in dialogue-heavy applications. While recent approaches typically train language models to mimic human-written summaries, such supervision is costly and often results in outputs that lack task-specific focus limiting their effectiveness in downstream applications, such as medical tasks. In this paper, we propose \app, a framework for task-oriented utility-based dialogue summarization. \app starts by generating multiple summaries and task-oriented question-answer pairs from a dialogue in a zero-shot manner using a pool of large language models (LLMs). The quality of the generated summaries is evaluated by having LLMs answer task-related questions before \textit{(i)} selecting the best candidate answers and \textit{(ii)} identifying the most informative summary based on these answers. Finally, we fine-tune the best LLM on the selected summaries. When validated on multiple datasets, \app demonstrates its effectiveness by achieving competitive results in various zero-shot settings, rivaling fully-supervised State-of-the-Art (SotA) methods.

QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization

TL;DR

QUARTZ addresses unsupervised, task-oriented dialogue summarization by leveraging a pool of LLMs to generate diverse summaries and task-focused QA pairs in a zero-shot setting. It employs a two-stage QA-based evaluation using Kendall tau aggregation and Mean Reciprocal Rank to identify the most informative summaries, followed by unsupervised fine-tuning of the top summarizer with a task-conditioned likelihood objective, formalized as maximizing P(S^* | D, T, θ). Across SAMSum, DialogSum, MTS-Dialog, and SimSAMU, QUARTZ achieves competitive or superior performance to fully supervised state-of-the-art methods, with clear gains in task relevance, factual soundness, and robustness in low-resource scenarios. The approach reduces annotation costs and has practical impact for domains like healthcare and business meetings, enabling coherent, task-focused summaries while maintaining transparency through open-source LLM usage and explicit evaluation protocols. The work also highlights the value of a diverse LLM pool and outlines future directions, including iterative QUARTZ and alternative fine-tuning strategies to further enhance alignment and quality.

Abstract

Dialogue summarization aims to distill the core meaning of a conversation into a concise text. This is crucial for reducing the complexity and noise inherent in dialogue-heavy applications. While recent approaches typically train language models to mimic human-written summaries, such supervision is costly and often results in outputs that lack task-specific focus limiting their effectiveness in downstream applications, such as medical tasks. In this paper, we propose \app, a framework for task-oriented utility-based dialogue summarization. \app starts by generating multiple summaries and task-oriented question-answer pairs from a dialogue in a zero-shot manner using a pool of large language models (LLMs). The quality of the generated summaries is evaluated by having LLMs answer task-related questions before \textit{(i)} selecting the best candidate answers and \textit{(ii)} identifying the most informative summary based on these answers. Finally, we fine-tune the best LLM on the selected summaries. When validated on multiple datasets, \app demonstrates its effectiveness by achieving competitive results in various zero-shot settings, rivaling fully-supervised State-of-the-Art (SotA) methods.

Paper Structure

This paper contains 43 sections, 9 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An overview of QUARTZ for unsupervised task-oriented dialogue summarization.
  • Figure 2: Impact of LLM pool size on summarization performance. Dotted lines represent the mean performance. Shaded regions indicate the standard deviation across the combinations (see Appendix \ref{['sec:appendix_llm_pool_config']}).
  • Figure 3: Win Rate of LLMs Across Datasets
  • Figure 4: An example dialogue from the DialogSum dataset. Left: The original dialogue and the reference summary. Right: The generated summaries from each LLM in the pool along with the task-related QAs. Additional information introduced by the best-selected summary is highlighted in bold.
  • Figure 5: Another example dialogue from the DialogSum dataset. Left: The original dialogue and the reference summary. Right: The generated summaries from each LLM in the pool along with the task-related QAs. Additional information introduced by the best-selected summary is highlighted in bold.
  • ...and 1 more figures