Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection
Yihao Xue, Kristjan Greenewald, Youssef Mroueh, Baharan Mirzasoleiman
TL;DR
The paper tackles hallucination detection in black-box LLMs by first demonstrating that self-consistency-based detectors nearly saturate achievable performance. It then introduces cross-model consistency with a verifier LLM and a budgeted, two-stage detection strategy that uses uncertainty-based switching to limit verifier calls. A kernel-mean-embedding framework supports the theoretical understanding and guides the design, showing that combining self- and cross-consistency can approach the oracle ceiling while significantly reducing compute. Empirically, across multiple datasets and model combinations, the approach achieves high detection performance with substantial cost savings, providing practical insights for deploying robust, scalable black-box hallucination detection. These contributions offer a principled path to improve reliability in real-world LLM applications without compromising privacy or accessibility.
Abstract
Large Language Models (LLMs) suffer from hallucination problems, which hinder their reliability in sensitive applications. In the black-box setting, several self-consistency-based techniques have been proposed for hallucination detection. We empirically study these techniques and show that they achieve performance close to that of a supervised (still black-box) oracle, suggesting little room for improvement within this paradigm. To address this limitation, we explore cross-model consistency checking between the target model and an additional verifier LLM. With this extra information, we observe improved oracle performance compared to purely self-consistency-based methods. We then propose a budget-friendly, two-stage detection algorithm that calls the verifier model only for a subset of cases. It dynamically switches between self-consistency and cross-consistency based on an uncertainty interval of the self-consistency classifier. We provide a geometric interpretation of consistency-based hallucination detection methods through the lens of kernel mean embeddings, offering deeper theoretical insights. Extensive experiments show that this approach maintains high detection performance while significantly reducing computational cost.
