Steer LLM Latents for Hallucination Detection
Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li
TL;DR
This work tackles the challenge of hallucination detection in LLMs by introducing the Truthfulness Separator Vector (TSV), a lightweight, inference-time vector that steers intermediate-layer representations to better separate truthful and hallucinated outputs without fine-tuning model parameters. It presents a two-stage training framework: (1) a small labeled exemplar stage to form tight class prototypes with a hyperspherical embedding model, and (2) augmentation with unlabeled generations via an optimal transport–based pseudo-labeling pipeline and confidence-based sample selection. TSV achieves state-of-the-art AUROC on multiple QA datasets with only tens of labeled examples, and it approaches fully supervised upper bounds while avoiding extensive labeling and fine-tuning. The approach demonstrates strong generalization across datasets and models, offering a practical, scalable solution for reliable LLM deployment in safety-critical scenarios, with efficient inference and broad applicability to future extensions such as token-level hallucination localization and long-form QA.
Abstract
Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.
