InterrogateLLM: Zero-Resource Hallucination Detection in LLM-Generated Answers
Yakir Yehuda, Itzik Malkiel, Oren Barkan, Jonathan Weill, Royi Ronen, Noam Koenigstein
TL;DR
The paper tackles the challenge of hallucinations in LLM outputs by introducing InterrogateLLM, a backward-interrogation approach that detects inconsistencies between the original query and queries reconstructed from the model's answer. By evaluating a forward prompt that generates an answer and a backward process that reconstructs the query set, then measuring embedding-based similarity against a threshold, the method remains external-knowledge-free and adaptable to few-shot settings. Across Movies, Books, and GCI tasks, with GPT-3 and Llama-2 variants, InterrogateLLM (especially with ensembles and higher $K$) outperforms baselines like SelfCheckGPT, SBERT/ADA cosine detectors, and ablations show the value of multiple backward passes and temperature variation. The findings suggest a scalable, prompt-based mechanism to improve LLM reliability in real-world applications, with future work extending to retrieval-augmented generation and addressing limitations such as many-to-one mappings and semi-truth detection.
Abstract
Despite the many advances of Large Language Models (LLMs) and their unprecedented rapid evolution, their impact and integration into every facet of our daily lives is limited due to various reasons. One critical factor hindering their widespread adoption is the occurrence of hallucinations, where LLMs invent answers that sound realistic, yet drift away from factual truth. In this paper, we present a novel method for detecting hallucinations in large language models, which tackles a critical issue in the adoption of these models in various real-world scenarios. Through extensive evaluations across multiple datasets and LLMs, including Llama-2, we study the hallucination levels of various recent LLMs and demonstrate the effectiveness of our method to automatically detect them. Notably, we observe up to 87% hallucinations for Llama-2 in a specific experiment, where our method achieves a Balanced Accuracy of 81%, all without relying on external knowledge.
