Table of Contents
Fetching ...

The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models

Amir Hameed Mir

TL;DR

This paper introduces Layer-wise Semantic Dynamics (LSD), a geometry-based framework that detects hallucinations in large language models by analyzing the evolution of hidden-state semantics across transformer layers. LSD aligns layerwise representations with ground-truth embeddings through margin-based contrastive learning and computes trajectory-based metrics (alignment, velocity, acceleration) to distinguish factually correct outputs from hallucinations. Empirical results on TruthfulQA and synthetic data show strong separability (F1 ≈ 0.92, AUROC ≈ 0.96, clustering ≈ 0.89) with a single forward pass, outperforming sampling-based baselines while enabling real-time monitoring and interpretability. The work provides both practical detection capabilities and deeper insights into how factual grounding manifests as convergent trajectories in semantic space, offering a principled path toward robust, scalable truth verification in LLMs.

Abstract

Large Language Models (LLMs) often produce fluent yet factually incorrect statements-a phenomenon known as hallucination-posing serious risks in high-stakes domains. We present Layer-wise Semantic Dynamics (LSD), a geometric framework for hallucination detection that analyzes the evolution of hidden-state semantics across transformer layers. Unlike prior methods that rely on multiple sampling passes or external verification sources, LSD operates intrinsically within the model's representational space. Using margin-based contrastive learning, LSD aligns hidden activations with ground-truth embeddings derived from a factual encoder, revealing a distinct separation in semantic trajectories: factual responses preserve stable alignment, while hallucinations exhibit pronounced semantic drift across depth. Evaluated on the TruthfulQA and synthetic factual-hallucination datasets, LSD achieves an F1-score of 0.92, AUROC of 0.96, and clustering accuracy of 0.89, outperforming SelfCheckGPT and Semantic Entropy baselines while requiring only a single forward pass. This efficiency yields a 5-20x speedup over sampling-based methods without sacrificing precision or interpretability. LSD offers a scalable, model-agnostic mechanism for real-time hallucination monitoring and provides new insights into the geometry of factual consistency within large language models.

The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models

TL;DR

This paper introduces Layer-wise Semantic Dynamics (LSD), a geometry-based framework that detects hallucinations in large language models by analyzing the evolution of hidden-state semantics across transformer layers. LSD aligns layerwise representations with ground-truth embeddings through margin-based contrastive learning and computes trajectory-based metrics (alignment, velocity, acceleration) to distinguish factually correct outputs from hallucinations. Empirical results on TruthfulQA and synthetic data show strong separability (F1 ≈ 0.92, AUROC ≈ 0.96, clustering ≈ 0.89) with a single forward pass, outperforming sampling-based baselines while enabling real-time monitoring and interpretability. The work provides both practical detection capabilities and deeper insights into how factual grounding manifests as convergent trajectories in semantic space, offering a principled path toward robust, scalable truth verification in LLMs.

Abstract

Large Language Models (LLMs) often produce fluent yet factually incorrect statements-a phenomenon known as hallucination-posing serious risks in high-stakes domains. We present Layer-wise Semantic Dynamics (LSD), a geometric framework for hallucination detection that analyzes the evolution of hidden-state semantics across transformer layers. Unlike prior methods that rely on multiple sampling passes or external verification sources, LSD operates intrinsically within the model's representational space. Using margin-based contrastive learning, LSD aligns hidden activations with ground-truth embeddings derived from a factual encoder, revealing a distinct separation in semantic trajectories: factual responses preserve stable alignment, while hallucinations exhibit pronounced semantic drift across depth. Evaluated on the TruthfulQA and synthetic factual-hallucination datasets, LSD achieves an F1-score of 0.92, AUROC of 0.96, and clustering accuracy of 0.89, outperforming SelfCheckGPT and Semantic Entropy baselines while requiring only a single forward pass. This efficiency yields a 5-20x speedup over sampling-based methods without sacrificing precision or interpretability. LSD offers a scalable, model-agnostic mechanism for real-time hallucination monitoring and provides new insights into the geometry of factual consistency within large language models.

Paper Structure

This paper contains 61 sections, 1 theorem, 19 equations, 9 figures, 5 tables.

Key Result

Theorem 1

Let $M_f$ and $M_h$ denote distributions of trajectory metric $M$ for factual and hallucinated samples. If $M_f$ and $M_h$ are approximately normal with means $\mu_f, \mu_h$ and pooled standard deviation $s_p$, then the effect size: quantifies standardized separation, with $|d| > 0.8$ indicating large effects.

Figures (9)

  • Figure 1: Layer-wise Semantic Convergence (Factual vs. Hallucinated Content). The plot illustrates the average semantic trajectory of factual (solid green line) and hallucination (dashed red line) samples as measured by Alignment with Truth (cosine similarity) across the model's layers. The factual trajectory maintains a statistically significant higher alignment compared to the hallucination trajectory across all 13 layers (marked by blue stars, $p < 0.05$). Factual content exhibits a consistently higher average similarity with the ground truth, stabilizing around an alignment of approximately $-0.15$ in the middle layers before a slight increase toward the final layer. Hallucination content settles at a lower alignment (approximately $-0.20$), confirming that the divergence leading to untruthful output is established early and persists with a consistently lower alignment profile throughout the network.
  • Figure 2: Comprehensive evaluation of LSD-based hallucination detection performance. The top-left panel compares core classification metrics (Precision, Recall, F1-Score, and AUROC) across models. The top-right panel presents ROC curves, where the LSD_LogisticRegression classifier achieves the highest discriminative performance (AUC = 0.959), indicating strong separability between factual and hallucinated trajectories. The bottom-left panel illustrates composite detection scores, confirming that the logistic regression model attains the highest aggregate score (0.920) among tested methods. Finally, the bottom-right panel shows cross-validation stability, where narrow error bars demonstrate consistent generalization performance across folds. Together, these results confirm that LSD representations are both geometrically interpretable and highly discriminative, achieving robust factual--hallucination separation under multiple evaluation regimes.
  • Figure 3: Comprehensive layer-wise semantic dynamics analysis. (Top-left) Semantic alignment trajectories: factual samples (green) show progressively increasing alignment with the truth manifold, while hallucinations (red) diverge and remain unstable across layers. (Top-right) Layer-wise significance: all layers exhibit extremely low $p$-values ($<10^{-40}$), confirming statistically reliable separation of alignment dynamics. (Bottom-left) Effect size distribution: large and consistent Cohen’s $d$ values ($\approx 3.0$) indicate strong discriminative power between factual and hallucinatory trajectories. (Bottom-right) Final alignment distribution: factual samples cluster near $+1.0$ (high truth alignment), while hallucinations concentrate around $-0.5$, forming a clear bimodal separation in representational space.
  • Figure 4: Violin plots of layer-wise semantic dynamics metrics comparing factual and hallucinated samples. The top panels show distributions of Mean Velocity and Mean Acceleration, which exhibit near-identical behavior across both sample types, suggesting that basic representational dynamics alone are insufficient to distinguish factuality. In contrast, the bottom-left panel (Alignment Gain, Final--Initial) demonstrates a clear separation: factual trajectories show strong positive gains (approximately $+0.4$), indicating convergence toward the truth manifold, while hallucinated trajectories exhibit negative drift. The bottom-right panel (Convergence Layer) shows that factual content reaches peak alignment in deeper layers (around layer 12), whereas hallucinations plateau prematurely in shallower layers, highlighting a fundamental difference in how truthful and untruthful representations evolve across the network.
  • Figure 5: Semantic geometry in the alignment--consistency plane. Scatter plot showing the distribution of generated samples in the Final Alignment versus Mean Direction Consistency plane. Each point represents a single output sequence from the model. Factual content (green) is tightly clustered in the high-alignment region ($\approx 0.7$ to $1.0$), forming a dense, well-defined manifold. Hallucinations (red) are widely scattered across the low-alignment and negative-alignment regions, exhibiting high geometric instability. The clear, statistically significant separation along the Final Alignment axis ($p < 0.000001$) suggests that truthful content occupies a distinct, high-fidelity subspace of the model's semantic geometry.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition 1: Language Model
  • Definition 2: Ground-truth Encoder
  • Definition 3: Hallucination Detection Function
  • Theorem 1: Trajectory Separability