Table of Contents
Fetching ...

Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection

Yihao Xue, Kristjan Greenewald, Youssef Mroueh, Baharan Mirzasoleiman

TL;DR

The paper tackles hallucination detection in black-box LLMs by first demonstrating that self-consistency-based detectors nearly saturate achievable performance. It then introduces cross-model consistency with a verifier LLM and a budgeted, two-stage detection strategy that uses uncertainty-based switching to limit verifier calls. A kernel-mean-embedding framework supports the theoretical understanding and guides the design, showing that combining self- and cross-consistency can approach the oracle ceiling while significantly reducing compute. Empirically, across multiple datasets and model combinations, the approach achieves high detection performance with substantial cost savings, providing practical insights for deploying robust, scalable black-box hallucination detection. These contributions offer a principled path to improve reliability in real-world LLM applications without compromising privacy or accessibility.

Abstract

Large Language Models (LLMs) suffer from hallucination problems, which hinder their reliability in sensitive applications. In the black-box setting, several self-consistency-based techniques have been proposed for hallucination detection. We empirically study these techniques and show that they achieve performance close to that of a supervised (still black-box) oracle, suggesting little room for improvement within this paradigm. To address this limitation, we explore cross-model consistency checking between the target model and an additional verifier LLM. With this extra information, we observe improved oracle performance compared to purely self-consistency-based methods. We then propose a budget-friendly, two-stage detection algorithm that calls the verifier model only for a subset of cases. It dynamically switches between self-consistency and cross-consistency based on an uncertainty interval of the self-consistency classifier. We provide a geometric interpretation of consistency-based hallucination detection methods through the lens of kernel mean embeddings, offering deeper theoretical insights. Extensive experiments show that this approach maintains high detection performance while significantly reducing computational cost.

Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection

TL;DR

The paper tackles hallucination detection in black-box LLMs by first demonstrating that self-consistency-based detectors nearly saturate achievable performance. It then introduces cross-model consistency with a verifier LLM and a budgeted, two-stage detection strategy that uses uncertainty-based switching to limit verifier calls. A kernel-mean-embedding framework supports the theoretical understanding and guides the design, showing that combining self- and cross-consistency can approach the oracle ceiling while significantly reducing compute. Empirically, across multiple datasets and model combinations, the approach achieves high detection performance with substantial cost savings, providing practical insights for deploying robust, scalable black-box hallucination detection. These contributions offer a principled path to improve reliability in real-world LLM applications without compromising privacy or accessibility.

Abstract

Large Language Models (LLMs) suffer from hallucination problems, which hinder their reliability in sensitive applications. In the black-box setting, several self-consistency-based techniques have been proposed for hallucination detection. We empirically study these techniques and show that they achieve performance close to that of a supervised (still black-box) oracle, suggesting little room for improvement within this paradigm. To address this limitation, we explore cross-model consistency checking between the target model and an additional verifier LLM. With this extra information, we observe improved oracle performance compared to purely self-consistency-based methods. We then propose a budget-friendly, two-stage detection algorithm that calls the verifier model only for a subset of cases. It dynamically switches between self-consistency and cross-consistency based on an uncertainty interval of the self-consistency classifier. We provide a geometric interpretation of consistency-based hallucination detection methods through the lens of kernel mean embeddings, offering deeper theoretical insights. Extensive experiments show that this approach maintains high detection performance while significantly reducing computational cost.

Paper Structure

This paper contains 16 sections, 2 theorems, 16 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 5.1

Suppose we are given $n_{neg}$ i.i.d. samples from the non-hallucinating distribution and $n_{pos}$ i.i.d. samples from the hallucinating distribution, and sets of candidate thresholds $\mathcal{T}_1 = \{t^1_j\}_{j=1}^{|\mathcal{T}_1|}$ and $\mathcal{T}_2 = \{t^2_k\}_{k=1}^{|\mathcal{T}_2|}$ for sta

Figures (9)

  • Figure 1: Two Stage Hallucination Detection. First, the self-consistency matrix ${\bm{P}}^{\text{self}}$ is formed and the test statistic is computed. This is thresholded with two thresholds, where medium values (gray region) advance to the second stage for disambiguation. The ${\bm{P}}^{\text{cross}}$ cross-consistency matrix and test statistic are then computed for these ambiguous samples for final classification.
  • Figure 2: Comparison between AUROC of existing methods and the approximated ceiling performance on SQuAD ((a)–(c)) and TriviaQA ((d)–(f)). We observe that, across all setups, the best method performs very close to the oracle, indicating that we are approaching the performance limit. A similar result is observed for AURAC in Fig. \ref{['fig: methods_vs_gcn_rac']}.
  • Figure 3: Comparison between approximated ceiling performances using only ${\bm{P}}^{\text{self}}$ (gray) and those using both ${\bm{P}}^{\text{self}}$ and ${\bm{P}}^{\text{cross}}$. The x-axis shows the target model, and the colors indicate the verifier model, as shown in the legend. We observe a clear improvement when a verifier model is used, in terms of both AUROC and AURAC.
  • Figure 4: A simple weighted average of self-consistency and cross-consistency-based metrics, $(1-\lambda) \text{MPD}({\bm{P}}^{\text{self}}) + \lambda \text{MPD}({\bm{P}}^{\text{cross}})$, can achieve performance close to that of the oracle method. Plots for AURAC are in Fig. \ref{['fig: weighted_avg_aurac']} in Appx. \ref{['apdx: additional_exp']}.
  • Figure 5: Geometric interpretation in mean embeddings spaces of "target" ($\mu_t$) and "verifier" distributions ($\mu_v$). Self consistency is measured via the norm of mean embeddings of the target model, and cross consistency via the dot product between mean embeddings. In stage one, detection is based on $\|\mu_t\|$, and we have no hallucination outside the sphere of radius $\sqrt{1-t_1}$ (in green) and hallucination within the sphere of radius $\sqrt{1-t_*}$ (in red). Between the two sphere the hyperplane defined by $\mu_v$ and $t_2$, splits this area in two zones: above it for no hallucination (dashed green) and below it for hallucination (dashed red).
  • ...and 4 more figures

Theorems & Definitions (3)

  • Remark 4.1: Cross entailment
  • Theorem 5.1: AUROC Generalization
  • Proposition 2.1