Table of Contents
Fetching ...

ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

Yuzhe Gu, Ziwei Ji, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen

TL;DR

ANAH-v2 introduces an EM-based iterative self-training framework to scale fine-grained hallucination annotation and improve annotator accuracy. Through a three-stage multi-dimensional data scaling and a robust three-phase analytical annotation process, a 7B-parameter annotator achieves state-of-the-art zero-shot performance on HaluEval and HalluQA, surpassing GPT-4 in accuracy. The framework enables automated hallucination evaluation and a simple re-ranking mitigation that measurably reduces hallucination metrics, demonstrating practical impact for evaluation and mitigation of LLM hallucinations. With a large, diverse ANAH-v2 dataset and strong generalization across languages and models, this approach provides a scalable path toward more trustworthy LLM deployment and richer benchmarks for future research.

Abstract

Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications. Current hallucination detection and mitigation datasets are limited in domains and sizes, which struggle to scale due to prohibitive labor costs and insufficient reliability of existing hallucination annotators. To facilitate the scalable oversight of LLM hallucinations, this paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset and improves the accuracy of the hallucination annotator. Based on the Expectation Maximization (EM) algorithm, in each iteration, the framework first applies a hallucination annotation pipeline to annotate a scaled dataset and then trains a more accurate hallucination annotator on the dataset. This new hallucination annotator is adopted in the hallucination annotation pipeline used for the next iteration. Extensive experimental results demonstrate that the finally obtained hallucination annotator with only 7B parameters surpasses the performance of GPT-4 and obtains new state-of-the-art hallucination detection results on HaluEval and HalluQA by zero-shot inference. Such an annotator can not only evaluate the hallucination levels of various LLMs on the large-scale dataset but also help to mitigate the hallucination of LLMs generations, with the Natural Language Inference (NLI) metric increasing from 25% to 37% on HaluEval.

ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

TL;DR

ANAH-v2 introduces an EM-based iterative self-training framework to scale fine-grained hallucination annotation and improve annotator accuracy. Through a three-stage multi-dimensional data scaling and a robust three-phase analytical annotation process, a 7B-parameter annotator achieves state-of-the-art zero-shot performance on HaluEval and HalluQA, surpassing GPT-4 in accuracy. The framework enables automated hallucination evaluation and a simple re-ranking mitigation that measurably reduces hallucination metrics, demonstrating practical impact for evaluation and mitigation of LLM hallucinations. With a large, diverse ANAH-v2 dataset and strong generalization across languages and models, this approach provides a scalable path toward more trustworthy LLM deployment and richer benchmarks for future research.

Abstract

Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications. Current hallucination detection and mitigation datasets are limited in domains and sizes, which struggle to scale due to prohibitive labor costs and insufficient reliability of existing hallucination annotators. To facilitate the scalable oversight of LLM hallucinations, this paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset and improves the accuracy of the hallucination annotator. Based on the Expectation Maximization (EM) algorithm, in each iteration, the framework first applies a hallucination annotation pipeline to annotate a scaled dataset and then trains a more accurate hallucination annotator on the dataset. This new hallucination annotator is adopted in the hallucination annotation pipeline used for the next iteration. Extensive experimental results demonstrate that the finally obtained hallucination annotator with only 7B parameters surpasses the performance of GPT-4 and obtains new state-of-the-art hallucination detection results on HaluEval and HalluQA by zero-shot inference. Such an annotator can not only evaluate the hallucination levels of various LLMs on the large-scale dataset but also help to mitigate the hallucination of LLMs generations, with the Natural Language Inference (NLI) metric increasing from 25% to 37% on HaluEval.
Paper Structure (21 sections, 4 equations, 9 figures, 11 tables)

This paper contains 21 sections, 4 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Our iterative self-training framework progressively scales up the hallucination annotation dataset size (left) and simultaneously increases the annotator's accuracy (right) in three stages.
  • Figure 1: The dataset size for ANAH-v2 in different stages, including the number of topics, model responses, and annotated sentences.
  • Figure 2: The schema of EM-based interactive self-training framework. In the E-step, given unlabeled new data from the Data Growth Flywheel, the annotator predicts N candidate outputs $y$. Then the representative annotation $y^*$ is chosen via self-consistency. As a result, we construct a larger dataset by collecting the new annotations. In the M-step, we train an annotator on the larger dataset aligned to our training format. This annotation process consists of three phases: Factual Existence Judgment, Reference Information Extraction, and Hallucination Type Judgment. As a result, we gain a stronger annotator with higher accuracy.
  • Figure 3: The topic distribution by chart of categories (inner) and domains (outer).
  • Figure A1: Prompts for factual existence judgment.
  • ...and 4 more figures