DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries

Sahana Ramnath; Nima Chitsazan; Mingyang Zhou; Chia-Hsuan Lee; Shi-Xiong Zhang; Stephen Rawls; Sambit Sahu; Sangwoo Cho; Xiang Ren; Genta Indra Winata; Akshaj Kumar Veldanda

DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries

Sahana Ramnath, Nima Chitsazan, Mingyang Zhou, Chia-Hsuan Lee, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Sangwoo Cho, Xiang Ren, Genta Indra Winata, Akshaj Kumar Veldanda

TL;DR

The paper introduces Dial-SummEr, a hierarchical error taxonomy and an annotated inference dataset to address the unique challenges of evaluating dialogue summaries. It distinguishes dialogue-level structural errors from within-turn content errors and accounts for narration perspective shifts from speakers to third-person narration. Empirical analyses reveal prevalent error patterns, including the prominence of mid-dialogue missed turns and end-of-summary extrinsic hallucinations. The work also assesses LLM-Judges as error detectors, showing modest performance improvements when prompted with the Dial-SummEr taxonomy and highlighting the need for further data and model development to robustly detect and correct dialogue-summery errors.

Abstract

Dialogues are a predominant mode of communication for humans, and it is immensely helpful to have automatically generated summaries of them (e.g., to revise key points discussed in a meeting, to review conversations between customer agents and product users). Prior works on dialogue summary evaluation largely ignore the complexities specific to this task: (i) shift in structure, from multiple speakers discussing information in a scattered fashion across several turns, to a summary's sentences, and (ii) shift in narration viewpoint, from speakers' first/second-person narration, standardized third-person narration in the summary. In this work, we introduce our framework DIALSUMMER to address the above. We propose DIAL-SUMMER's taxonomy of errors to comprehensively evaluate dialogue summaries at two hierarchical levels: DIALOGUE-LEVEL that focuses on the broader speakers/turns, and WITHIN-TURN-LEVEL that focuses on the information talked about inside a turn. We then present DIAL-SUMMER's dataset composed of dialogue summaries manually annotated with our taxonomy's fine-grained errors. We conduct empirical analyses of these annotated errors, and observe interesting trends (e.g., turns occurring in middle of the dialogue are the most frequently missed in the summary, extrinsic hallucinations largely occur at the end of the summary). We also conduct experiments on LLM-Judges' capability at detecting these errors, through which we demonstrate the challenging nature of our dataset, the robustness of our taxonomy, and the need for future work in this field to enhance LLMs' performance in the same. Code and inference dataset coming soon.

DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries

TL;DR

Abstract

Paper Structure (22 sections, 1 equation, 3 figures, 8 tables)

This paper contains 22 sections, 1 equation, 3 figures, 8 tables.

Introduction
Related Work
Dial-SummEr's taxonomy
Descriptions of Errors
Addressing the complexities in dialogue summary evaluation
Dial-SummEr Dataset
Analysis of Dataset
Sentence-level analysis of errors.
Detecting errors using LLM-as-a-Judge
Experimental Setup
Results & Discussion
Coarse Hallucination error.
Fine-grained errors.
Conclusion
Trust and Risk.
...and 7 more sections

Figures (3)

Figure 1: Our framework Dial-SummEr addresses errors that arise due to the shift in structure and narration viewpoint from a dialogue to its summary. Our framework is composed of (i) an error taxonomy with both dialogue-level and within-turn-level errors, (ii) a human-annotated dialogue-summary inference dataset with the same.
Figure 2: Error frequency: % summaries out of 192 in Dial-SummEr's dataset which exhibit the error.
Figure 3: We take the human-annotated errors in Dial-SummEr's dataset and plot the distribution of the positions of the error in the summary (or the dialogue, for Missed Turn). We see that extrinsic hallucination tends to occur at the end of the summary, Viewpoint Distortion at the start, and intrinsic hallucinations and missed turns largely occur in the middle.

DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries

TL;DR

Abstract

DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries

Authors

TL;DR

Abstract

Table of Contents

Figures (3)