Table of Contents
Fetching ...

A Geometric Taxonomy of Hallucinations in LLMs

Javier Marín

TL;DR

The contribution is a geometric taxonomy clarifying the scope of embedding-based detection: Types I and II are detectable; Type III requires external verification mechanisms, and Type III requires external verification mechanisms.

Abstract

The term "hallucination" in large language models conflates distinct phenomena with different geometric signatures in embedding space. We propose a taxonomy identifying three types: unfaithfulness (failure to engage with provided context), confabulation (invention of semantically foreign content), and factual error (incorrect claims within correct conceptual frames). We observe a striking asymmetry. On standard benchmarks where hallucinations are LLM-generated, detection is domain-local: AUROC 0.76-0.99 within domains, but 0.50 (chance level) across domains. Discriminative directions are approximately orthogonal between domains (mean cosine similarity -0.07). On human-crafted confabulations - invented institutions, redefined terminology, fabricated mechanisms - a single global direction achieves 0.96 AUROC with 3.8% cross-domain degradation. We interpret this divergence as follows: benchmarks capture generation artifacts (stylistic signatures of prompted fabrication), while human-crafted confabulations capture genuine topical drift. The geometric structure differs because the underlying phenomena differ. Type III errors show 0.478 AUROC - indistinguishable from chance. This reflects a theoretical constraint: embeddings encode distributional co-occurrence, not correspondence to external reality. Statements with identical contextual patterns occupy similar embedding regions regardless of truth value. The contribution is a geometric taxonomy clarifying the scope of embedding-based detection: Types I and II are detectable; Type III requires external verification mechanisms.

A Geometric Taxonomy of Hallucinations in LLMs

TL;DR

The contribution is a geometric taxonomy clarifying the scope of embedding-based detection: Types I and II are detectable; Type III requires external verification mechanisms, and Type III requires external verification mechanisms.

Abstract

The term "hallucination" in large language models conflates distinct phenomena with different geometric signatures in embedding space. We propose a taxonomy identifying three types: unfaithfulness (failure to engage with provided context), confabulation (invention of semantically foreign content), and factual error (incorrect claims within correct conceptual frames). We observe a striking asymmetry. On standard benchmarks where hallucinations are LLM-generated, detection is domain-local: AUROC 0.76-0.99 within domains, but 0.50 (chance level) across domains. Discriminative directions are approximately orthogonal between domains (mean cosine similarity -0.07). On human-crafted confabulations - invented institutions, redefined terminology, fabricated mechanisms - a single global direction achieves 0.96 AUROC with 3.8% cross-domain degradation. We interpret this divergence as follows: benchmarks capture generation artifacts (stylistic signatures of prompted fabrication), while human-crafted confabulations capture genuine topical drift. The geometric structure differs because the underlying phenomena differ. Type III errors show 0.478 AUROC - indistinguishable from chance. This reflects a theoretical constraint: embeddings encode distributional co-occurrence, not correspondence to external reality. Statements with identical contextual patterns occupy similar embedding regions regardless of truth value. The contribution is a geometric taxonomy clarifying the scope of embedding-based detection: Types I and II are detectable; Type III requires external verification mechanisms.
Paper Structure (32 sections, 13 equations, 1 figure, 7 tables)

This paper contains 32 sections, 13 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Geometric taxonomy of hallucination types on the embedding hypersphere $\mathbf{S}^{d-1}$. Query $\mathbf{q}$ and context $\mathbf{c}$ define anchor points. The plausibility manifold $\mathcal{P}_q$ (green, dashed) contains semantically appropriate responses. A grounded response (blue) departs from $\mathbf{q}$ toward $\mathbf{c}$ and lands within $\mathcal{P}_q$. Type I unfaithfulness (purple) shows semantic laziness, remaining near $\mathbf{q}$. Type II confabulation (red) departs in an unrelated direction, landing outside $\mathcal{P}_q$. Type III factual error (orange) reaches $\mathcal{P}_q$ but occupies a factually incorrect position within the plausible region.

Theorems & Definitions (4)

  • Definition 1: Semantic Grounding Index
  • Definition 2: Directional Grounding Index - $\Gamma$
  • Definition 3: Local Directional Grounding Index
  • Definition 4: Plausibility Manifold