Table of Contents
Fetching ...

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, Kathleen McKeown

TL;DR

TofuEval introduces a topic-focused dialogue summarization benchmark with expert-annotated factuality, relevance, and completeness labels for LLM-generated summaries from MediaSum and MeetingBank. The study systematically compares LLMs as both summarizers and evaluators, finding substantial hallucinations in dialogue summaries and that non-LLM factuality metrics generally outperform LLM-based evaluators. A detailed error taxonomy reveals diverse hallucination types, and results highlight remaining challenges in detecting factual errors, especially for main-topic content. The authors release the dataset to spur progress in robust automated evaluation of dialogue summaries and to inform the development of more faithful summarization systems.

Abstract

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

TL;DR

TofuEval introduces a topic-focused dialogue summarization benchmark with expert-annotated factuality, relevance, and completeness labels for LLM-generated summaries from MediaSum and MeetingBank. The study systematically compares LLMs as both summarizers and evaluators, finding substantial hallucinations in dialogue summaries and that non-LLM factuality metrics generally outperform LLM-based evaluators. A detailed error taxonomy reveals diverse hallucination types, and results highlight remaining challenges in detecting factual errors, especially for main-topic content. The authors release the dataset to spur progress in robust automated evaluation of dialogue summaries and to inform the development of more faithful summarization systems.

Abstract

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.
Paper Structure (85 sections, 3 equations, 9 figures, 13 tables)

This paper contains 85 sections, 3 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: TofuEval contains 1.5K topic-focused summaries from two dialogue summarization datasets. We ask expert linguistic annotators to evaluate completeness, relevance and factual consistency of each summary, along with explanations and error types for factually inconsistent sentences.
  • Figure 2: Error distribution over factually inconsistent summary sentences for TofuEval (left) and for each summarizer over main/marginal topics (right). See error distributions over all summary sentences for each summarizer over main/marginal topics in Appendix Figure \ref{['fig:error_dist_plot_appendix']}.
  • Figure 3: Error taxonomy and definitions. We include examples of factually inconsistent summary sentences and corresponding human annotated explanations from TofuEval. Error spans are highlighted (not included in TofuEval).
  • Figure 4: Recall of summary factual inconsistency predictions by error types.Non-LLM based factuality metrics are better at capturing errors than LLM-based evaluators across all error types.
  • Figure 5: Error distributions over all summary sentences for each summarizer for main/marginal topics.
  • ...and 4 more figures