Table of Contents
Fetching ...

HARP: Hallucination Detection via Reasoning Subspace Projection

Junjie Hu, Gang Tu, ShengYu Cheng, Jinxin Li, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan

TL;DR

This work introduces HARP, a framework that detects hallucinations in large language models by explicitly separating hidden-state representations into semantic and reasoning subspaces. By applying SVD to the Unembedding layer and using a low-rank approximation, HARP identifies basis vectors for both subspaces and projects hidden states onto the reasoning subspace to form compact, robust features for hallucination detection. The detector, trained with beam-generated supervision, achieves state-of-the-art AUROC across multiple QA datasets and models, while maintaining strong robustness under distribution shifts. The results validate the direct-sum decomposition of hidden states and demonstrate the causal role of the reasoning subspace in generation, with potential for future reasoning-aware hallucination mitigation.

Abstract

Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes. Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained. Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness. Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%.

HARP: Hallucination Detection via Reasoning Subspace Projection

TL;DR

This work introduces HARP, a framework that detects hallucinations in large language models by explicitly separating hidden-state representations into semantic and reasoning subspaces. By applying SVD to the Unembedding layer and using a low-rank approximation, HARP identifies basis vectors for both subspaces and projects hidden states onto the reasoning subspace to form compact, robust features for hallucination detection. The detector, trained with beam-generated supervision, achieves state-of-the-art AUROC across multiple QA datasets and models, while maintaining strong robustness under distribution shifts. The results validate the direct-sum decomposition of hidden states and demonstrate the causal role of the reasoning subspace in generation, with potential for future reasoning-aware hallucination mitigation.

Abstract

Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes. Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained. Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness. Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%.

Paper Structure

This paper contains 29 sections, 28 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Comparison of the "Reasoning $\rightarrow$ Expression" behavior between humans and LLMs
  • Figure 2: Illustration of the proposed HARP framework for hallucination detection. HARP separates the reasoning information $h_{l,\text{Reasoning}}$ from the hidden state $h_l$ to compute token-level hallucination scores, with the maximum score taken as the hallucination score of the entire response.
  • Figure 3: Flow of semantic and reasoning information within LLMs hidden states.
  • Figure 4: (a) Singular value distributions of $W_{unemb}$ after SVD, with hidden state dimensions of 3584 for Qwen-2.5-7B-Instruct and 4096 for LLaMA-3.1-8B. (b) Projections of hidden states onto the basis vectors of the semantic and reasoning subspaces across layers, where the first row shows the first three layers and the second row shows the last three layers. Further details are provided in \ref{['sec:Universal_representation_hidden_states']}.
  • Figure 5: (a) Greedy token rankings in $logits^{\prime}$ under different reasoning subspace dimensions. (b) Effect of reasoning subspace dimension on hallucination detection performance.
  • ...and 9 more figures