Table of Contents
Fetching ...

Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

Atharva Kulkarni, Yuan Zhang, Joel Ruben Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, Hong Yu

TL;DR

This study interrogates the reliability and generalizability of hallucination-detection metrics for language models by conducting a large-scale empirical evaluation across 4 datasets, 37 models from 5 families, and 5 decoding methods. It finds pervasive misalignment between most metrics and human judgments, with limited cross-dataset consistency and no uniform gains from increasing model size. GPT-4-based evaluation emerges as the most reliable approach, and mode-seeking decoding along with ensemble metric scores offer practical improvements for reducing hallucinations in knowledge-grounded settings. The results emphasize the need for robust, generalizable evaluation metrics and caution against overreliance on any single metric for guiding mitigation strategies.

Abstract

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.

Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

TL;DR

This study interrogates the reliability and generalizability of hallucination-detection metrics for language models by conducting a large-scale empirical evaluation across 4 datasets, 37 models from 5 families, and 5 decoding methods. It finds pervasive misalignment between most metrics and human judgments, with limited cross-dataset consistency and no uniform gains from increasing model size. GPT-4-based evaluation emerges as the most reliable approach, and mode-seeking decoding along with ensemble metric scores offer practical improvements for reducing hallucinations in knowledge-grounded settings. The results emphasize the need for robust, generalizable evaluation metrics and caution against overreliance on any single metric for guiding mitigation strategies.

Abstract

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.

Paper Structure

This paper contains 23 sections, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Spearman rank correlation between hallucination metrics reveals weak to no correlation for both Begin and HaluEval datasets.
  • Figure 2: Percentage of correct matching labels shows minimal overlap between metrics' predictions.
  • Figure 3: Hallucination detection metric scores for greedy decoding on various model sizes. Circles and hexagons represent pretrained and instruction-tuned models, respectively.
  • Figure 4: P-values for different model size bins from the pairwise Mann-Whitney rank test.
  • Figure 5: Metric accuracy across varying response lengths on the Begin and HaluEval.
  • ...and 10 more figures