German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset

Laura Mascarell; Ribin Chalumattu; Annette Rios

German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset

Laura Mascarell, Ribin Chalumattu, Annette Rios

TL;DR

Absinth is presented, a manually annotated dataset for hallucination detection in German news summarization and the capabilities of novel open-source LLMs on this task are explored in both fine-tuning and in-context learning settings.

Abstract

The advent of Large Language Models (LLMs) has led to remarkable progress on a wide range of natural language processing tasks. Despite the advances, these large-sized models still suffer from hallucinating information in their output, which poses a major issue in automatic text summarization, as we must guarantee that the generated summary is consistent with the content of the source document. Previous research addresses the challenging task of detecting hallucinations in the output (i.e. inconsistency detection) in order to evaluate the faithfulness of the generated summaries. However, these works primarily focus on English and recent multilingual approaches lack German data. This work presents absinth, a manually annotated dataset for hallucination detection in German news summarization and explores the capabilities of novel open-source LLMs on this task in both fine-tuning and in-context learning settings. We open-source and release the absinth dataset to foster further research on hallucination detection in German.

German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset

TL;DR

Abstract

Paper Structure (23 sections, 2 figures, 7 tables)

This paper contains 23 sections, 2 figures, 7 tables.

Introduction
The absinth Dataset
Dataset Construction
Annotation Task
Gold Standard
Intuitive Annotation Framework
Continuous Evaluation
Final Dataset
Inconsistency Detection Task
Models Selection
Results
Related Work
Conclusion
Ethics Statement
Limitations
...and 8 more sections

Figures (2)

Figure 1: Class distribution for each summarization model in absinth. The largest models GPT-4 and Stable Beluga 2 generate the least hallucinations. Since summaries are of different sentence length, the total of instances varies among models.
Figure 2: User interface of the annotation framework. We provide the article and all summary sentences. The interface highlights the summary sentence that is currently being reviewed.

German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset

TL;DR

Abstract

German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (2)