Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Sanjana Ramprasad; Elisa Ferracane; Zachary C. Lipton

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Sanjana Ramprasad, Elisa Ferracane, Zachary C. Lipton

TL;DR

The paper probes the faithfulness of LLMs in dialogue summarization, introducing Circumstantial Inference as a new error category and providing a refined error taxonomy. It benchmarks zero-shot GPT-4 and Alpaca-13B against fine-tuned models on SAMSum and DialogSum, annotating inconsistencies and revealing that LLMs often produce plausible but unsupported inferences. To improve detection, the authors propose two prompt-based, span-aware error detectors that outperform traditional factuality metrics, especially for Circumstantial Inference. The study highlights limitations of current evaluation metrics for evolving LLM capabilities and emphasizes the need for benchmarks that reflect newer model distributions, while releasing the dataset to foster further research. Overall, the work advances understanding of dialogue summarization fidelity and offers practical approaches to detect nuanced errors in LLM-generated summaries.

Abstract

Recent advancements in large language models (LLMs) have considerably advanced the capabilities of summarization systems. However, they continue to face concerns about hallucinations. While prior work has evaluated LLMs extensively in news domains, most evaluation of dialogue summarization has focused on BART-based models, leaving a gap in our understanding of their faithfulness. Our work benchmarks the faithfulness of LLMs for dialogue summarization, using human annotations and focusing on identifying and categorizing span-level inconsistencies. Specifically, we focus on two prominent LLMs: GPT-4 and Alpaca-13B. Our evaluation reveals subtleties as to what constitutes a hallucination: LLMs often generate plausible inferences, supported by circumstantial evidence in the conversation, that lack direct evidence, a pattern that is less prevalent in older models. We propose a refined taxonomy of errors, coining the category of "Circumstantial Inference" to bucket these LLM behaviors and release the dataset. Using our taxonomy, we compare the behavioral differences between LLMs and older fine-tuned models. Additionally, we systematically assess the efficacy of automatic error detection methods on LLM summaries and find that they struggle to detect these nuanced errors. To address this, we introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics, particularly for identifying "Circumstantial Inference."

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

TL;DR

Abstract

Paper Structure (35 sections, 6 figures, 3 tables)

This paper contains 35 sections, 6 figures, 3 tables.

Introduction
Human Evaluation: Zero-shot Prompted Dialogue Summaries
Datasets
Models
Fine-grained inconsistency annotation
Evaluation Results
Automatic Error Detection
Metrics
Binary Classification
Span Detection
Evaluation Setup
Results
Conclusion
Limitations and Ethics
Citations
...and 20 more sections

Figures (6)

Figure 1: In the example provided, GPT-4 infers that the speakers are discussing "their son." Although this inference seems plausible given the circumstantial evidence in the conversation, it lacks direct evidence.
Figure 2: Each bar in this plot depicts the proportion of total summaries with inconsistencies across different model-generated summaries where GPT-4 performs the best (lower means fewer inconsistencies).
Figure 4: Automatic error detectors exhibit varying performance when applied to FT-Summ versus LLM. While QA/NLI metrics indicate a slight improvement, prompt-based metrics are better in detecting inconsistencies generated by the FT-Summ model in comparison to LLMs.
Figure 5: Inconsistency binary classification per error category
Figure 7: Inconsistency rate of all models per dataset. We see that Alpaca-13B is competitive with BART with respect to consistency on the DialogSum dataset but is outperformed on the SAMSum datasets
...and 1 more figures

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

TL;DR

Abstract

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Authors

TL;DR

Abstract

Table of Contents

Figures (6)