Table of Contents
Fetching ...

Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types

Matic Korun

TL;DR

Whitening as preprocessing that reveals cluster commitment as the theoretically correct separating metric, evidence that the Type~1/2 boundary is a capacity limitation rather than a measurement artifact, and a methodological finding about prompt-set fragility in near-saturated representation spaces are found.

Abstract

A geometric hallucination taxonomy distinguishes three failure types -- center-drift (Type~1), wrong-well convergence (Type~2), and coverage gaps (Type~3) -- by their signatures in embedding cluster space. Prior work found Types~1 and~2 indistinguishable in full-dimensional contextual measurement. We address this through PCA-whitening and eigenspectrum decomposition on GPT-2-small, using multi-run stability analysis (20 seeds) with prompt-level aggregation. Whitening transforms the micro-signal regime into a space where peak cluster alignment (max\_sim) separates Type~2 from Type~3 at Holm-corrected significance, with condition means following the taxonomy's predicted ordering: Type~2 (highest commitment) $>$ Type~1 (intermediate) $>$ Type~3 (lowest). A first directionally stable but underpowered hint of Type~1/2 separation emerges via the same metric, generating a capacity prediction for larger models. Prompt diversification from 15 to 30 prompts per group eliminates a false positive in whitened entropy that appeared robust at the smaller set, demonstrating prompt-set sensitivity in the micro-signal regime. Eigenspectrum decomposition localizes this artifact to the dominant principal components and confirms that Type~1/2 separation does not emerge in any spectral band, rejecting the spectral mixing hypothesis. The contribution is threefold: whitening as preprocessing that reveals cluster commitment as the theoretically correct separating metric, evidence that the Type~1/2 boundary is a capacity limitation rather than a measurement artifact, and a methodological finding about prompt-set fragility in near-saturated representation spaces.

Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types

TL;DR

Whitening as preprocessing that reveals cluster commitment as the theoretically correct separating metric, evidence that the Type~1/2 boundary is a capacity limitation rather than a measurement artifact, and a methodological finding about prompt-set fragility in near-saturated representation spaces are found.

Abstract

A geometric hallucination taxonomy distinguishes three failure types -- center-drift (Type~1), wrong-well convergence (Type~2), and coverage gaps (Type~3) -- by their signatures in embedding cluster space. Prior work found Types~1 and~2 indistinguishable in full-dimensional contextual measurement. We address this through PCA-whitening and eigenspectrum decomposition on GPT-2-small, using multi-run stability analysis (20 seeds) with prompt-level aggregation. Whitening transforms the micro-signal regime into a space where peak cluster alignment (max\_sim) separates Type~2 from Type~3 at Holm-corrected significance, with condition means following the taxonomy's predicted ordering: Type~2 (highest commitment) Type~1 (intermediate) Type~3 (lowest). A first directionally stable but underpowered hint of Type~1/2 separation emerges via the same metric, generating a capacity prediction for larger models. Prompt diversification from 15 to 30 prompts per group eliminates a false positive in whitened entropy that appeared robust at the smaller set, demonstrating prompt-set sensitivity in the micro-signal regime. Eigenspectrum decomposition localizes this artifact to the dominant principal components and confirms that Type~1/2 separation does not emerge in any spectral band, rejecting the spectral mixing hypothesis. The contribution is threefold: whitening as preprocessing that reveals cluster commitment as the theoretically correct separating metric, evidence that the Type~1/2 boundary is a capacity limitation rather than a measurement artifact, and a methodological finding about prompt-set fragility in near-saturated representation spaces.
Paper Structure (43 sections, 1 equation, 4 figures, 3 tables)

This paper contains 43 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Whitened max_sim results. (a) Per-prompt mean max_sim by condition from the representative seed ($N = 30$/group). The ordering T2 $>$ T1 $>$ T3 matches the taxonomy's core prediction: wrong-well commits to a cluster (highest), center-drift drifts without commitment (intermediate), coverage gap aligns with nothing (lowest). Grand means annotated. (b) Rank-biserial effect size $r$ across 20 independent generation seeds, with median and IQR bars. T2--T3 (green): 20/20 directional stability, 8/20 Holm. T1--T2 (blue): 17/20 directional, 3/20 Holm---the emerging signal. T1--T3 (gray): directionally stable but weak, reflecting the combined gap.
  • Figure 2: The $H(\mathbf{v})$ collapse and max_sim emergence. Each dot is one seed's prompt-level Mann-Whitney $p$-value (20 seeds total). Vertical ticks: median. Red dashed line: $\alpha = 0.05$. (a) Whitened $H(\mathbf{v})$ at $N = 30$: $p$-values scattered widely for all pairs. The strongest $N = 15$ signals---T1--T3 (65% Holm) and T2--T3 (30% Holm)---collapse to 5% and 0% significance respectively. (b) Whitened max_sim at $N = 30$: T2--T3 (green) clusters near zero (12/20 sig, 8/20 Holm). T1--T2 (blue) shows a partial cluster below $\alpha$ (7/20 sig, 3/20 Holm)---the emerging signal.
  • Figure 3: Spectral band decomposition ($N = 15$/group, 20 seeds). Cell color: fraction of 20 runs where prompt-level Mann-Whitney $p < 0.05$. Cell annotations (where sig rate $\geq 15$%): median rank-biserial $r$ (top) and Holm survival rate (bottom). T1--T2 columns are uniformly pale across all bands---Type 1/2 separation does not exist in any spectral region. Dominant band $H(\mathbf{v})$ (PCs 1--16) shows the artifact that collapses at $N = 30$. Tail band effects (PCs 513--768) are genuine but in $<$0.1% variance dimensions.
  • Figure 4: Token-level vs. prompt-level significance ($-\log_{10} p$, median across 20 seeds) for all experiment--metric--pair combinations. Dashed lines: $p = 0.05$. Circles: whitened ($N = 30$). Squares: raw norm ($N = 30$). Diamonds: spectral bands ($N = 15$). Color: pair type (blue = T1--T2, gray = T1--T3, green = T2--T3). Points in the "token only" quadrant (lower right) are pseudoreplication artifacts. The spectral tail produces the most extreme both-levels results.