Table of Contents
Fetching ...

Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context

Faris Chaudhry, Siddhant Gadkari

TL;DR

It is suggested that ICL emerges from the construction of task-adaptive statistical estimators rather than simple similarity matching, and appears to adapt the point at which decisions become linearly decodable.

Abstract

In-context learning (ICL) allows Transformers to adapt to novel tasks without weight updates, yet the underlying algorithms remain poorly understood. We adopt a statistical decision-theoretic perspective by investigating simple binary hypothesis testing, where the optimal policy is determined by the likelihood-ratio test. Notably, this setup provides a mathematically rigorous setting for mechanistic interpretability where the target algorithmic ground truth is known. By training Transformers on tasks requiring distinct geometries (linear shifted means vs. nonlinear variance estimation), we demonstrate that the models approximate the Bayes-optimal sufficient statistics from context up to some monotonic transformation, matching the performance of an ideal oracle estimator in nonlinear regimes. Leveraging this analytical ground truth, mechanistic analysis via logit lens and circuit alignment suggests that the model does not rely on a fixed kernel smoothing heuristic. Instead, it appears to adapt the point at which decisions become linearly decodable: exhibiting patterns consistent with a voting-style ensemble for linear tasks while utilizing a deeper sequential computation for nonlinear tasks. These findings suggest that ICL emerges from the construction of task-adaptive statistical estimators rather than simple similarity matching.

Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context

TL;DR

It is suggested that ICL emerges from the construction of task-adaptive statistical estimators rather than simple similarity matching, and appears to adapt the point at which decisions become linearly decodable.

Abstract

In-context learning (ICL) allows Transformers to adapt to novel tasks without weight updates, yet the underlying algorithms remain poorly understood. We adopt a statistical decision-theoretic perspective by investigating simple binary hypothesis testing, where the optimal policy is determined by the likelihood-ratio test. Notably, this setup provides a mathematically rigorous setting for mechanistic interpretability where the target algorithmic ground truth is known. By training Transformers on tasks requiring distinct geometries (linear shifted means vs. nonlinear variance estimation), we demonstrate that the models approximate the Bayes-optimal sufficient statistics from context up to some monotonic transformation, matching the performance of an ideal oracle estimator in nonlinear regimes. Leveraging this analytical ground truth, mechanistic analysis via logit lens and circuit alignment suggests that the model does not rely on a fixed kernel smoothing heuristic. Instead, it appears to adapt the point at which decisions become linearly decodable: exhibiting patterns consistent with a voting-style ensemble for linear tasks while utilizing a deeper sequential computation for nonlinear tasks. These findings suggest that ICL emerges from the construction of task-adaptive statistical estimators rather than simple similarity matching.
Paper Structure (36 sections, 10 equations, 5 figures, 2 tables)

This paper contains 36 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Approximation of the LLR. Regression of the Transformer's output logits against the true analytical LLR for validation episodes. (Left) Task A: The model exhibits a strong linear correlation ($r=0.859$), indicating it approximates the affine sufficient statistic $\mu^\top(x-k)$. (Right) Task B: The model achieves near-perfect rank correlation ($\rho=0.976$), effectively recovering the quadratic sufficient statistic $\|x\|^2$ up to a monotone transform. The sigmoidal shape suggests the model has learned a calibrated probability mapping, saturating for high-confidence inputs while preserving the optimal decision ordering.
  • Figure 2: Mechanistic Adaptivity.(Left) Logit Lens (Task A): The correlation with the true LLR rises significantly in Layer 1, suggesting early linear decoding or aggregation. (Right) OV Circuit Alignment: In Task A (top), Layer 0 heads (e.g., Head 2) show strong positive alignment ($>0.7$) with the logit direction, acting as voting ensemble. In Task B (bottom), Layer 0 heads are effectively silent ($<0.26$), implying that the model suppresses early voting to perform deeper sequential processing in Layer 1. Both OV circuits are taken from representative seeds; qualitatively similar behavior persisted across seeds.
  • Figure 3: OOD Generalization Degradation (Task A).(Left) Learning curves show a significant generalization gap: while the model masters the training distribution (blue), it struggles to extrapolate to large shifts (orange), achieving only partial generalization. (Right) The correlation with the true LLR degrades to $r=0.567$; the learned decision rule is a local approximation rather than the exact symbolic LLR.
  • Figure 4: Transformer vs. Kernel Regression. The low correlation indicates the model implements a more complex decision rule than similarity-based label smoothing.
  • Figure 5: Logit Lens for Task B. The Pearson and Spearman correlations with the true LLR are effectively zero for the initial layers, spiking only at the final output. This confirms that the model does not perform a greedy linear approximation early in the network, but relies on the full depth of the Transformer to construct some nonlinear decision boundary.