Uncovering Uncertainty in Transformer Inference
Greyson Brothers, Willa Mannering, Amber Tien, John Winder
TL;DR
This paper investigates the Iterative Inference Hypothesis for transformer language models, asking how latent representations in the residual stream are progressively refined during autoregressive generation and whether correct and incorrect outputs diverge along this trajectory. The authors propose Residual Cross-Entropy as a lightweight, per-layer diagnostic to quantify convergence of residual predictions toward the next-token embedding, and validate this approach on GPT-2 XL with an idiom-completion dataset. Key findings include observable per-layer loss decay in the $n^{th}$ token embedding, a strong association between lower cross-entropy to the chosen target and correct generations (AUC $=0.9239$), and evidence that output cross-entropy tracks model uncertainty in open-ended prompts. The work suggests a practical uncertainty signal for mitigating hallucinations with minimal computation and outlines future work on broader datasets, multi-token generation, and additional convergence metrics.
Abstract
We explore the Iterative Inference Hypothesis (IIH) within the context of transformer-based language models, aiming to understand how a model's latent representations are progressively refined and whether observable differences are present between correct and incorrect generations. Our findings provide empirical support for the IIH, showing that the nth token embedding in the residual stream follows a trajectory of decreasing loss. Additionally, we observe that the rate at which residual embeddings converge to a stable output representation reflects uncertainty in the token generation process. Finally, we introduce a method utilizing cross-entropy to detect this uncertainty and demonstrate its potential to distinguish between correct and incorrect token generations on a dataset of idioms.
