Table of Contents
Fetching ...

PSentScore: Evaluating Sentiment Polarity in Dialogue Summarization

Yongxin Zhou, Fabien Ringeval, François Portet

TL;DR

This paper tackles the problem that dialogue summaries often omit affective content, despite its value for user experience and healthcare applications. It introduces PSent, a word-level sentiment proportion, and PSentScore, a reference-less metric that assesses how well sentiment in dialogues is preserved in summaries via correlation and error measures. By training word-level sentiment analyzers and applying a sentiment-driven data filtering strategy, the authors show that preserving affective content can be substantially improved, albeit with modest trade-offs in traditional factual metrics. The work provides a practical, reproducible framework for sentiment-aware dialogue summarization and suggests that sentiment-aligned training data can enhance affective preservation in generated summaries with meaningful real-world implications.

Abstract

Automatic dialogue summarization is a well-established task with the goal of distilling the most crucial information from human conversations into concise textual summaries. However, most existing research has predominantly focused on summarizing factual information, neglecting the affective content, which can hold valuable insights for analyzing, monitoring, or facilitating human interactions. In this paper, we introduce and assess a set of measures PSentScore, aimed at quantifying the preservation of affective content in dialogue summaries. Our findings indicate that state-of-the-art summarization models do not preserve well the affective content within their summaries. Moreover, we demonstrate that a careful selection of the training set for dialogue samples can lead to improved preservation of affective content in the generated summaries, albeit with a minor reduction in content-related metrics.

PSentScore: Evaluating Sentiment Polarity in Dialogue Summarization

TL;DR

This paper tackles the problem that dialogue summaries often omit affective content, despite its value for user experience and healthcare applications. It introduces PSent, a word-level sentiment proportion, and PSentScore, a reference-less metric that assesses how well sentiment in dialogues is preserved in summaries via correlation and error measures. By training word-level sentiment analyzers and applying a sentiment-driven data filtering strategy, the authors show that preserving affective content can be substantially improved, albeit with modest trade-offs in traditional factual metrics. The work provides a practical, reproducible framework for sentiment-aware dialogue summarization and suggests that sentiment-aligned training data can enhance affective preservation in generated summaries with meaningful real-world implications.

Abstract

Automatic dialogue summarization is a well-established task with the goal of distilling the most crucial information from human conversations into concise textual summaries. However, most existing research has predominantly focused on summarizing factual information, neglecting the affective content, which can hold valuable insights for analyzing, monitoring, or facilitating human interactions. In this paper, we introduce and assess a set of measures PSentScore, aimed at quantifying the preservation of affective content in dialogue summaries. Our findings indicate that state-of-the-art summarization models do not preserve well the affective content within their summaries. Moreover, we demonstrate that a careful selection of the training set for dialogue samples can lead to improved preservation of affective content in the generated summaries, albeit with a minor reduction in content-related metrics.
Paper Structure (27 sections, 4 equations, 4 figures, 5 tables)

This paper contains 27 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Box plots for PSentDial (left) vs. PSentSumm (right) distribution using BERT-DS-SST3 on the full DialogSum training and validation sets. Filtered means that samples with PSentDial or PSentSumm values equal to zero have been removed.
  • Figure 2: Example of test_20, with three references, and predictions as well as visualization of the attention of two models: baseline-BART$_{Large}$ and baseline_Filtered.
  • Figure 3: Example of test_151, with three references, and predictions as well as visualization of the attention of two models: baseline-BART$_{Large}$ and baseline_Filtered.
  • Figure 4: Example of test_440, with three references, and predictions as well as visualization of the attention of two models: baseline-BART$_{Large}$ and baseline_Filtered.