Table of Contents
Fetching ...

Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

Shang Liu, Zhongze Cai, Guanting Chen, Xiaocheng Li

TL;DR

This work advances the theoretical and empirical understanding of in-context learning by framing it as uncertainty quantification in a bi-objective Transformer setup that predicts both mean and uncertainty. It derives a sharp generalization bound that depends on the context window $S$ and sequence length $T$, showing near Bayes-optimal in-distribution risk and highlighting the role of training distribution information. Through extensive experiments, the authors reveal how ICL responds to task, covariate, and length shifts, propose meta-training to bolster covariate robustness, and show that removing positional encoding can improve long-context generalization. The study clarifies the limits of Bayesian interpretation under OOD and points to practical strategies for designing pretraining and prompt designs that enhance robust ICL.

Abstract

Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from all the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation $\mathbb{E}[Y|X]$ and the conditional variance Var$(Y|X)$. This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window $S$, we prove a generalization bound of $\tilde{\mathcal{O}}(\sqrt{\min\{S, T\}/(n T)})$ on $n$ tasks with sequences of length $T$, providing sharper analysis compared to previous results of $\tilde{\mathcal{O}}(\sqrt{1/n})$. Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the \textit{equivalence} between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.

Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

TL;DR

This work advances the theoretical and empirical understanding of in-context learning by framing it as uncertainty quantification in a bi-objective Transformer setup that predicts both mean and uncertainty. It derives a sharp generalization bound that depends on the context window and sequence length , showing near Bayes-optimal in-distribution risk and highlighting the role of training distribution information. Through extensive experiments, the authors reveal how ICL responds to task, covariate, and length shifts, propose meta-training to bolster covariate robustness, and show that removing positional encoding can improve long-context generalization. The study clarifies the limits of Bayesian interpretation under OOD and points to practical strategies for designing pretraining and prompt designs that enhance robust ICL.

Abstract

Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from all the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation and the conditional variance Var. This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window , we prove a generalization bound of on tasks with sequences of length , providing sharper analysis compared to previous results of . Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the \textit{equivalence} between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.
Paper Structure (35 sections, 23 theorems, 125 equations, 9 figures)

This paper contains 35 sections, 23 theorems, 125 equations, 9 figures.

Key Result

Proposition 3.1

The Bayes-optimal predictor of the step-wise population risk defined in eqn:BO is given by

Figures (9)

  • Figure 1: Transformer behaves close to the Bayes-optimal predictor for in-distribution tasks. Details of the distributions in data generation are given in Section \ref{['app:train_data']}. The numbers 4096 and 65536 refer to the number of tasks (configurations of $(w_i,\sigma_i)$) used in the training, which is formally defined in Section \ref{['app:pool_size']}. The Bayes-optimal predictor is stated in Proposition \ref{['prop:BO']} and calculated analytically in Section \ref{['app:BO_derive']}. For the left panel, the y-axis gives the mean squared error in predicting $y_t$. For the right panel, the $y$-axis gives the average of the predicted uncertainty over all the test samples (average of $\hat{\sigma}(H_t)$ or $\sigma^*(H_t)$ on test samples). In particular, we note that ridge regression and linear regression (ordinary least squares) do not naturally produce a measurement of uncertainty, so we use the sum of residuals on the in-context samples as their estimates of uncertainty. More visualizations are deferred to Section \ref{['app:more_ID_figure']}.
  • Figure 2: OOD performances of Transformers and the Bayes-optimal predictor. The $y$-axis gives the average of the predicted uncertainty over all the test samples (average of $\hat{\sigma}(H_t)$ or $\sigma^*(H_t)$ on test samples), and ideally, they should converge to the expected uncertainty level of 0.5 (S-OOD), 2 (M-OOD), and 4 (L-OOD) as in-context samples increase. There are three OOD environments: small (S-OOD), medium (M-OOD), and large (L-OOD) that reflect the intensity of the OOD. Two versions of the Transformer model are trained with a pool size of 4096 and 65536. The Transformers and the Bayes-optimal predictor are the same as the ones in Figure \ref{['fig:ID_performance']}. The only difference is that they are evaluated on OOD data here.
  • Figure 3: The effect of removing positional encoding on prompt length generalization. The $y$-axis records the average error of uncertainty prediction, which is the difference between the uncertainty predicted by the transformer and the Bayes-optimal estimator. (a) For models trained with prompt lengths $\leq$ 44, the figure on the left shows that positional encoding has the worst generalization capacity with a larger length, and removing positional encoding could effectively enhance the length generalization power. (b) For models trained with prompt lengths $\geq$ 45, removing positional encoding can help generalize to smaller lengths, although the generalization ability for smaller lengths is generally weaker compared to that for larger lengths.
  • Figure 4: In-distribution performance of the uncertainty prediction against the Bayes-optimal predictor. The $y$-axis gives an estimate of $\mathbb{E}\left[-\log |\hat{\sigma}(H_t)-\sigma^*(H_t)|\right]$ where the expectation is taken with respect to $H_t$. Here $\hat{\sigma}(H_t)$ is the uncertainty estimate produced by an algorithm (ridge regression, linear regression, or transformer), and $\sigma^*(H_t)$ is the Bayes-optimal predictor given in Proposition \ref{['prop:BO']} and calculated by Section \ref{['app:BO_derive']}. The figure shows that the Transformer and the Bayes-optimal predictor produce similar uncertainty predictions. In addition, the Transformer trained on a larger pool of tasks (larger $N$) produces a better approximation of the Bayes-optimal predictor.
  • Figure 5: Performance under L-OOD setting. For both (a) and (b), the $y$-axis gives the average of the predicted uncertainty over all the test samples (average of $\hat{\sigma}(H_t)$ or $\sigma^*(H_t)$ on test samples), and ideally the curves should converge to the true uncertainty level of $4$ as the number of in-context samples increases. In (a), we compare the Bayes-optimal predictor that uses the wrong prior with the Bayes-optimal predictor that uses the correct prior (which replaces the in-distribution prior with the correct OOD prior of $\sigma^2$). Both work well in that the curves converge to the true mean uncertainty level of around $4$. The Transformers deviate from both Bayes-optimal predictors due to the large OOD intensity. In (b), we observe that as the training task diversity increases. The transformer gradually moves from the ID reference line to the Bayes-optimal predictor.
  • ...and 4 more figures

Theorems & Definitions (48)

  • Definition 2.1: Bayes-optimal predictor
  • Proposition 3.1: Bayes-optimal predictor for mean and uncertainty prediction
  • Theorem 3.2
  • Definition B.1: Multi-Head Attention
  • Definition B.2: Multi-Layer Perceptron
  • Definition B.3: Transformer
  • Remark B.4
  • Example B.7
  • Example C.1: A corollary that can be derived based on Theorem 4.1 in zhang2023trained
  • proof
  • ...and 38 more