Table of Contents
Fetching ...

AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis

Natalia Grigoriadou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

TL;DR

This work tackles the challenge of detecting fluent overgeneration hallucinations in SHROOM, a SemEval-2024 task spanning DM, MT, and PG. It proposes a lightweight, black-box pipeline that fine-tunes a hallucination-detection model and an NLI model on SHROOM-adjacent data, then ensembles them via a Voting Classifier to improve accuracy. The ensemble achieves near $0.80$ accuracy on the model-aware track and about $0.78$ on the model-agnostic track, outperforming the provided baseline and approaching top-competition results, while maintaining efficiency and avoiding model probing. Comprehensive analyses, including per-task results, error inspection, and runtime data, support the approach's practicality and interpretability for robust hallucination detection in real-world NLG systems.

Abstract

In this paper, we present our team's submissions for SemEval-2024 Task-6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The participants were asked to perform binary classification to identify cases of fluent overgeneration hallucinations. Our experimentation included fine-tuning a pre-trained model on hallucination detection and a Natural Language Inference (NLI) model. The most successful strategy involved creating an ensemble of these models, resulting in accuracy rates of 77.8% and 79.9% on model-agnostic and model-aware datasets respectively, outperforming the organizers' baseline and achieving notable results when contrasted with the top-performing results in the competition, which reported accuracies of 84.7% and 81.3% correspondingly.

AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis

TL;DR

This work tackles the challenge of detecting fluent overgeneration hallucinations in SHROOM, a SemEval-2024 task spanning DM, MT, and PG. It proposes a lightweight, black-box pipeline that fine-tunes a hallucination-detection model and an NLI model on SHROOM-adjacent data, then ensembles them via a Voting Classifier to improve accuracy. The ensemble achieves near accuracy on the model-aware track and about on the model-agnostic track, outperforming the provided baseline and approaching top-competition results, while maintaining efficiency and avoiding model probing. Comprehensive analyses, including per-task results, error inspection, and runtime data, support the approach's practicality and interpretability for robust hallucination detection in real-world NLG systems.

Abstract

In this paper, we present our team's submissions for SemEval-2024 Task-6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The participants were asked to perform binary classification to identify cases of fluent overgeneration hallucinations. Our experimentation included fine-tuning a pre-trained model on hallucination detection and a Natural Language Inference (NLI) model. The most successful strategy involved creating an ensemble of these models, resulting in accuracy rates of 77.8% and 79.9% on model-agnostic and model-aware datasets respectively, outperforming the organizers' baseline and achieving notable results when contrasted with the top-performing results in the competition, which reported accuracies of 84.7% and 81.3% correspondingly.
Paper Structure (27 sections, 12 figures, 8 tables)

This paper contains 27 sections, 12 figures, 8 tables.

Figures (12)

  • Figure 1: p('Hallucination') for all misclassified samples of model aware dataset.
  • Figure 2: p('Hallucination') for all misclassified samples of model agnostic dataset.
  • Figure 3: Distribution of per task samples in the initially released trial set.
  • Figure 4: Distribution of unlabelled training samples per task in both model-agnostic and model-aware settings.
  • Figure 5: Distribution of labeled validation samples per task in both model-agnostic and model-aware settings.
  • ...and 7 more figures