Table of Contents
Fetching ...

SmurfCat at SemEval-2024 Task 6: Leveraging Synthetic Data for Hallucination Detection

Elisei Rykov, Yana Shishkina, Kseniia Petrushina, Kseniia Titova, Sergey Petrakov, Alexander Panchenko

TL;DR

The paper tackles hallucination detection in SemEval-2024 by leveraging synthetic data to augment training. It compares baselines, fine-tuned embeddings (E5-Mistral with LoRA), and a refined Mutual Implication Score, alongside content-preservation assessments, and combines them via ensemble methods. Ensembling, especially Voting, yields the strongest performance and approaches top model-agnostic standings, with MIS trained on PAWS and PG+DM data showing notable gains, while GPT-4 prompted data can introduce biases. The work demonstrates that synthetic-domain augmentation, when paired with careful data curation and ensembling, can close the gap to leader approaches and offers public data/code for adoption and extension.

Abstract

In this paper, we present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards, encompassing diverse baselines, the refinement of pre-trained encoders through supervised learning, and an ensemble approaches utilizing several high-performing models. Through these explorations, we introduce three distinct methods that exhibit strong performance metrics. To amplify our training data, we generate additional training samples from unlabelled training subset. Furthermore, we provide a detailed comparative analysis of our approaches. Notably, our premier method achieved a commendable 9th place in the competition's model-agnostic track and 17th place in model-aware track, highlighting its effectiveness and potential.

SmurfCat at SemEval-2024 Task 6: Leveraging Synthetic Data for Hallucination Detection

TL;DR

The paper tackles hallucination detection in SemEval-2024 by leveraging synthetic data to augment training. It compares baselines, fine-tuned embeddings (E5-Mistral with LoRA), and a refined Mutual Implication Score, alongside content-preservation assessments, and combines them via ensemble methods. Ensembling, especially Voting, yields the strongest performance and approaches top model-agnostic standings, with MIS trained on PAWS and PG+DM data showing notable gains, while GPT-4 prompted data can introduce biases. The work demonstrates that synthetic-domain augmentation, when paired with careful data curation and ensembling, can close the gap to leader approaches and offers public data/code for adoption and extension.

Abstract

In this paper, we present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards, encompassing diverse baselines, the refinement of pre-trained encoders through supervised learning, and an ensemble approaches utilizing several high-performing models. Through these explorations, we introduce three distinct methods that exhibit strong performance metrics. To amplify our training data, we generate additional training samples from unlabelled training subset. Furthermore, we provide a detailed comparative analysis of our approaches. Notably, our premier method achieved a commendable 9th place in the competition's model-agnostic track and 17th place in model-aware track, highlighting its effectiveness and potential.
Paper Structure (25 sections, 1 equation, 3 figures, 10 tables)

This paper contains 25 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Classifier architecture when using synthetic data.
  • Figure 2: Prompt for GPT-4 evaluation on PG task.
  • Figure 3: Prompt for PG data with hallucinations generation using GPT-4.