Table of Contents
Fetching ...

From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction

Matic Korun

TL;DR

The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.

Abstract

We test whether a geometric hallucination taxonomy -- classifying failures as center-drift (Type~1), wrong-well convergence (Type~2), or coverage gaps (Type~3) -- can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts ($N = 15$/group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type~3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median $r = +0.61$). In contextual hidden states, the Type~3 norm effect direction is stable (19/20 runs) but underpowered at $N = 15$ (significant in 4/20, median $r = -0.28$). Types~1 and~2 do not separate in either space (${\leq}\,3/20$ runs). Token-level tests inflate significance by 4--16$\times$ through pseudoreplication -- a finding replicated across all 20 runs. The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.

From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction

TL;DR

The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.

Abstract

We test whether a geometric hallucination taxonomy -- classifying failures as center-drift (Type~1), wrong-well convergence (Type~2), or coverage gaps (Type~3) -- can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts (/group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type~3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median ). In contextual hidden states, the Type~3 norm effect direction is stable (19/20 runs) but underpowered at (significant in 4/20, median ). Types~1 and~2 do not separate in either space ( runs). Token-level tests inflate significance by 4--16 through pseudoreplication -- a finding replicated across all 20 runs. The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.
Paper Structure (44 sections, 4 figures, 7 tables)

This paper contains 44 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Effect-size stability across 20 independent runs. Each row shows one metric $\times$ pairwise comparison. Thick bars: IQR of rank-biserial $r$; thin whiskers: full range; dots: median. Green shading: Holm-corrected significance in ${\geq}50\%$ of runs. Yellow shading: ${\geq}15\%$. Right annotations: nominal significance rate. Static norm T1--T3 and T2--T3 are the only comparisons with stable Holm survival. Contextual effects are consistently directional but underpowered at $N = 15$.
  • Figure 2: Pseudoreplication inflation in the contextual experiment across 20 runs. Each row shows one metric $\times$ pairwise comparison. Squares: prompt-level significance rate ($N = 15$). Circles: token-level significance rate ($N \approx 900$). Connectors show the inflation gap, annotated as a percentage. The largest inflation (max_sim T2--T3) reaches 16$\times$: significant at token level in 80% of runs but at prompt level in only 5%. This replicated pattern provides an empirical benchmark: researchers observing token-level significance in autoregressive generation should expect 4--16$\times$ inflation relative to properly aggregated prompt-level analysis.
  • Figure 3: Contextual hidden-state signatures in norm--$H(\mathbf{v})$ space from a representative run (seed closest to median $p$-values). Colored points: generated tokens per condition. Dashed lines: percentile-calibrated zone boundaries. The narrow dynamic range of contextual $H(\mathbf{v})$ (${\approx}\,0.93$--$0.999$) is evident. Type 3 shows a slight leftward (lower norm) shift visible in the aggregate but not individually distinguishable.
  • Figure 4: Distribution of prompt-level Mann-Whitney $p$-values across 20 runs for the norm metric. Each dot is one run; horizontal bars show medians. Dashed line: $\alpha = 0.05$. Left: static embeddings show T1--T3 and T2--T3 medians well below $\alpha$. Right: contextual hidden states show medians above $\alpha$ despite the same directional effect, confirming the underpowered regime.