Table of Contents
Fetching ...

Steer LLM Latents for Hallucination Detection

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li

TL;DR

This work tackles the challenge of hallucination detection in LLMs by introducing the Truthfulness Separator Vector (TSV), a lightweight, inference-time vector that steers intermediate-layer representations to better separate truthful and hallucinated outputs without fine-tuning model parameters. It presents a two-stage training framework: (1) a small labeled exemplar stage to form tight class prototypes with a hyperspherical embedding model, and (2) augmentation with unlabeled generations via an optimal transport–based pseudo-labeling pipeline and confidence-based sample selection. TSV achieves state-of-the-art AUROC on multiple QA datasets with only tens of labeled examples, and it approaches fully supervised upper bounds while avoiding extensive labeling and fine-tuning. The approach demonstrates strong generalization across datasets and models, offering a practical, scalable solution for reliable LLM deployment in safety-critical scenarios, with efficient inference and broad applicability to future extensions such as token-level hallucination localization and long-form QA.

Abstract

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

Steer LLM Latents for Hallucination Detection

TL;DR

This work tackles the challenge of hallucination detection in LLMs by introducing the Truthfulness Separator Vector (TSV), a lightweight, inference-time vector that steers intermediate-layer representations to better separate truthful and hallucinated outputs without fine-tuning model parameters. It presents a two-stage training framework: (1) a small labeled exemplar stage to form tight class prototypes with a hyperspherical embedding model, and (2) augmentation with unlabeled generations via an optimal transport–based pseudo-labeling pipeline and confidence-based sample selection. TSV achieves state-of-the-art AUROC on multiple QA datasets with only tens of labeled examples, and it approaches fully supervised upper bounds while avoiding extensive labeling and fine-tuning. The approach demonstrates strong generalization across datasets and models, offering a practical, scalable solution for reliable LLM deployment in safety-critical scenarios, with efficient inference and broad applicability to future extensions such as token-level hallucination localization and long-form QA.

Abstract

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

Paper Structure

This paper contains 46 sections, 16 equations, 8 figures, 12 tables, 2 algorithms.

Figures (8)

  • Figure 1: T-SNE visualization van2008visualizing of the last-token embeddings from the final layer of LLaMA-3.1-8B on the TruthfulQA test set. (a) Pre-trained model's embeddings exhibit significant overlap, whereas (b) adding TSV to latent states of an intermediate LLM layer effectively separates the embeddings of truthful and hallucinated data.
  • Figure 2: Overall framework. In the initial training phase, Truthfulness Separator Vector (TSV) is trained on an exemplar set. After initial training, (1) we assign soft pseudo-labels to the unlabeled data, (2) select confident pseudo-labeled samples, and (3) augment the exemplar set with selected samples. Finally, we retrain TSV with the augmented set. Best viewed in color.
  • Figure 3: (a) Effect of steering location (layer index and MHA components) on TruthfulQA performance, (b) effect of steering strength $\lambda$ (\ref{['sec:few_shot']}), and (c) effect of the number of labeled exemplars. All results are reported as AUROC using LLaMA-3.1-8b.
  • Figure 4: Generalization results on out-of-distribution datasets.
  • Figure 5: Score distributions for HaloScope vs. our method.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 3.1: Hallucination detector
  • Definition 3.2: Unlabeled data