Table of Contents
Fetching ...

Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures

Matic Korun

TL;DR

These findings establish the geometric prerequisites for type-specific hallucination detection and yield testable predictions about architecture-dependent vulnerability profiles about architecture-dependent vulnerability profiles.

Abstract

We propose a geometric taxonomy of large language model hallucinations based on observable signatures in token embedding cluster structure. By analyzing the static embedding spaces of 11 transformer models spanning encoder (BERT, RoBERTa, ELECTRA, DeBERTa, ALBERT, MiniLM, DistilBERT) and decoder (GPT-2) architectures, we identify three operationally distinct hallucination types: Type 1 (center-drift) under weak context, Type 2 (wrong-well convergence) to locally coherent but contextually incorrect cluster regions, and Type 3 (coverage gaps) where no cluster structure exists. We introduce three measurable geometric statistics: α (polarity coupling), \b{eta} (cluster cohesion), and λ_s (radial information gradient). Across all 11 models, polarity structure (α > 0.5) is universal (11/11), cluster cohesion (\b{eta} > 0) is universal (11/11), and the radial information gradient is significant (9/11, p < 0.05). We demonstrate that the two models failing λ_s significance -- ALBERT and MiniLM -- do so for architecturally explicable reasons: factorized embedding compression and distillation-induced isotropy, respectively. These findings establish the geometric prerequisites for type-specific hallucination detection and yield testable predictions about architecture-dependent vulnerability profiles.

Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures

TL;DR

These findings establish the geometric prerequisites for type-specific hallucination detection and yield testable predictions about architecture-dependent vulnerability profiles about architecture-dependent vulnerability profiles.

Abstract

We propose a geometric taxonomy of large language model hallucinations based on observable signatures in token embedding cluster structure. By analyzing the static embedding spaces of 11 transformer models spanning encoder (BERT, RoBERTa, ELECTRA, DeBERTa, ALBERT, MiniLM, DistilBERT) and decoder (GPT-2) architectures, we identify three operationally distinct hallucination types: Type 1 (center-drift) under weak context, Type 2 (wrong-well convergence) to locally coherent but contextually incorrect cluster regions, and Type 3 (coverage gaps) where no cluster structure exists. We introduce three measurable geometric statistics: α (polarity coupling), \b{eta} (cluster cohesion), and λ_s (radial information gradient). Across all 11 models, polarity structure (α > 0.5) is universal (11/11), cluster cohesion (\b{eta} > 0) is universal (11/11), and the radial information gradient is significant (9/11, p < 0.05). We demonstrate that the two models failing λ_s significance -- ALBERT and MiniLM -- do so for architecturally explicable reasons: factorized embedding compression and distillation-induced isotropy, respectively. These findings establish the geometric prerequisites for type-specific hallucination detection and yield testable predictions about architecture-dependent vulnerability profiles.
Paper Structure (44 sections, 4 equations, 5 figures, 2 tables)

This paper contains 44 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Geometric signatures of the three hallucination types in BERT-base embedding space ($k=40$ clusters). Left: Type 1 (center-drift) zone---low norm, low cluster membership. Center: Type 2 (wrong-well) zone---high max centroid similarity, high membership. Right: Type 3 (coverage gap) zone---low max centroid similarity, variable self-information. Colored points indicate tokens falling in each zone; gray points show the full distribution.
  • Figure 2: Radial information gradient for BERT-base. Left: Mean self-information vs. embedding norm with linear and quadratic fits. The quadratic model ($R^2 = 0.992$) significantly outperforms the linear ($R^2 = 0.934$; $F = 231.7$, $p < 0.001$). Right: Residuals show the quadratic fit eliminates the systematic curvature present in linear residuals.
  • Figure 3: Architectural anomaly analysis. Top row: Radial entropy profiles with degree 1--3 polynomial fits. ALBERT (128D) shows high variance from dimensional compression; MiniLM (384D) shows a near-linear inverted profile; GPT-2 (768D) shows an order-of-magnitude weaker gradient than BERT. BERT-base (768D) provides the clear quadratic baseline. Bottom row: Cumulative PCA variance. ALBERT needs nearly all dimensions to capture 95% variance (space is full). MiniLM distributes variance almost uniformly (extreme isotropy). GPT-2 and BERT show similar utilization profiles despite qualitatively different radial structure.
  • Figure 4: Distributions of $\alpha$ (polarity coupling) and $\beta$ (cluster cohesion, both methods) for BERT-base with $k=40$ clusters.
  • Figure 5: $\lambda_{\mathrm{r}}$ across all 11 models. Top: Magnitude and sign of the quadratic coefficient (green = significant at $p < 0.05$, red = non-significant). Bottom: Linear vs. quadratic $R^2$---the quadratic model improves fit in nearly all cases.