AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis

Natalia Grigoriadou; Maria Lymperaiou; Giorgos Filandrianos; Giorgos Stamou

AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis

Natalia Grigoriadou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

TL;DR

This work tackles the challenge of detecting fluent overgeneration hallucinations in SHROOM, a SemEval-2024 task spanning DM, MT, and PG. It proposes a lightweight, black-box pipeline that fine-tunes a hallucination-detection model and an NLI model on SHROOM-adjacent data, then ensembles them via a Voting Classifier to improve accuracy. The ensemble achieves near $0.80$ accuracy on the model-aware track and about $0.78$ on the model-agnostic track, outperforming the provided baseline and approaching top-competition results, while maintaining efficiency and avoiding model probing. Comprehensive analyses, including per-task results, error inspection, and runtime data, support the approach's practicality and interpretability for robust hallucination detection in real-world NLG systems.

Abstract

In this paper, we present our team's submissions for SemEval-2024 Task-6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The participants were asked to perform binary classification to identify cases of fluent overgeneration hallucinations. Our experimentation included fine-tuning a pre-trained model on hallucination detection and a Natural Language Inference (NLI) model. The most successful strategy involved creating an ensemble of these models, resulting in accuracy rates of 77.8% and 79.9% on model-agnostic and model-aware datasets respectively, outperforming the organizers' baseline and achieving notable results when contrasted with the top-performing results in the competition, which reported accuracies of 84.7% and 81.3% correspondingly.

AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis

TL;DR

accuracy on the model-aware track and about

on the model-agnostic track, outperforming the provided baseline and approaching top-competition results, while maintaining efficiency and avoiding model probing. Comprehensive analyses, including per-task results, error inspection, and runtime data, support the approach's practicality and interpretability for robust hallucination detection in real-world NLG systems.

Abstract

Paper Structure (27 sections, 12 figures, 8 tables)

This paper contains 27 sections, 12 figures, 8 tables.

Introduction
Related Work
NLP hallucinations
Task and Dataset description
Data details
Evaluation metrics
Methods
Fine-tune hallucination detection model
Fine-tune NLI models
Voting Classifier
Experiments
Experimental setup
Fine-tune hallucination model
Natural Language Inference (NLI) models
Voting Classifier
...and 12 more sections

Figures (12)

Figure 1: p('Hallucination') for all misclassified samples of model aware dataset.
Figure 2: p('Hallucination') for all misclassified samples of model agnostic dataset.
Figure 3: Distribution of per task samples in the initially released trial set.
Figure 4: Distribution of unlabelled training samples per task in both model-agnostic and model-aware settings.
Figure 5: Distribution of labeled validation samples per task in both model-agnostic and model-aware settings.
...and 7 more figures

AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis

TL;DR

Abstract

AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (12)