Table of Contents
Fetching ...

Optimal Attention Temperature Enhances In-Context Learning under Distribution Shift

Samet Demir, Zafer Dogan

TL;DR

This work addresses how to enhance in-context learning robustness of pretrained Transformers when test data deviate from pretraining distributions. By adopting a linearized softmax attention model, it derives closed-form generalization-error expressions and proves that an optimal attention temperature $\tau$ exists to minimize error under distribution shift. The authors validate the theory with synthetic linear-regression tasks and large language models (GPT-2 and LLaMA-2-7B), showing that adjusting $\tau$ can substantially improve ICL performance in practical deployments. Overall, the paper provides a principled framework and actionable guidance for selecting attention temperature to bolster ICL under real-world distribution shifts.

Abstract

Pretrained Transformers excel at in-context learning (ICL), inferring new tasks from only a handful of examples. Yet, their ICL performance can degrade sharply under distribution shift between pretraining and test data, a regime increasingly common in real-world deployments. While recent empirical work hints that adjusting the attention temperature in the softmax can enhance Transformer performance, the attention temperature's role in ICL under distribution shift remains unexplored. This paper provides the first theoretical and empirical study of attention temperature for ICL under distribution shift. Using a simplified but expressive "linearized softmax" framework, we derive closed-form generalization error expressions and prove that shifts in input covariance or label noise substantially impair ICL, but that an optimal attention temperature exists which minimizes this error. We then validate our predictions through extensive simulations on linear regression tasks and large-scale experiments with GPT-2 and LLaMA2-7B on question-answering benchmarks. Our results establish attention temperature as a principled and powerful mechanism for improving the robustness of ICL in pretrained Transformers, advancing theoretical understanding and providing actionable guidance for selecting attention temperature in practice.

Optimal Attention Temperature Enhances In-Context Learning under Distribution Shift

TL;DR

This work addresses how to enhance in-context learning robustness of pretrained Transformers when test data deviate from pretraining distributions. By adopting a linearized softmax attention model, it derives closed-form generalization-error expressions and proves that an optimal attention temperature exists to minimize error under distribution shift. The authors validate the theory with synthetic linear-regression tasks and large language models (GPT-2 and LLaMA-2-7B), showing that adjusting can substantially improve ICL performance in practical deployments. Overall, the paper provides a principled framework and actionable guidance for selecting attention temperature to bolster ICL under real-world distribution shifts.

Abstract

Pretrained Transformers excel at in-context learning (ICL), inferring new tasks from only a handful of examples. Yet, their ICL performance can degrade sharply under distribution shift between pretraining and test data, a regime increasingly common in real-world deployments. While recent empirical work hints that adjusting the attention temperature in the softmax can enhance Transformer performance, the attention temperature's role in ICL under distribution shift remains unexplored. This paper provides the first theoretical and empirical study of attention temperature for ICL under distribution shift. Using a simplified but expressive "linearized softmax" framework, we derive closed-form generalization error expressions and prove that shifts in input covariance or label noise substantially impair ICL, but that an optimal attention temperature exists which minimizes this error. We then validate our predictions through extensive simulations on linear regression tasks and large-scale experiments with GPT-2 and LLaMA2-7B on question-answering benchmarks. Our results establish attention temperature as a principled and powerful mechanism for improving the robustness of ICL in pretrained Transformers, advancing theoretical understanding and providing actionable guidance for selecting attention temperature in practice.

Paper Structure

This paper contains 44 sections, 5 theorems, 82 equations, 6 figures, 2 tables.

Key Result

Lemma 4.1

When the temperature parameter is set to $\tau = 1$ during pretraining, the following parameter configuration approximates the Bayes-optimal estimator in (eq:bayes_optimal_ridge_estimator): where $\hat{{\bm{X}}} \in \mathbb{R}^{ml \times d}$ is the centered input matrix formed from $ml$ samples of ${\bm{x}}$. This configuration aligns the our model with Bayes-optimal ridge regression. The quantit

Figures (6)

  • Figure 1: Experiments with Transformer (\ref{['eq:linearized_attention']}) on ICL under distribution shifts. Parameters are set using (\ref{['eq:pretraining']}) while the optimal temperature is calculated by Theorem \ref{['theorem:optimal_temperature']}. Here, $d=50$, $m=5000$ (with a new task per sample), $\sigma = 0.1$, ${\bm{\mu}}_x^{train} = {\bm{\mu}}_w^{train} = \mathbf{0}$, and ${\bm{\Sigma}}_x^{train} = {\bm{\Sigma}}_w^{train} = {\bm{I}}$.
  • Figure 2: Effect of noise shift on Transformer (\ref{['eq:linearized_attention']}). The pretraining noise is $\sigma_{train} = 0.1$, while $\sigma_{test}$ varies across plots. The optimal temperature is set by Theorem \ref{['theorem:optimal_temperature']}. This setting matches Figure \ref{['fig:linearized_attention_experiments']}a, except for changes in test-time noise $\sigma_{test}$.
  • Figure 3: Effect of attention temperature on the ICL performance of LLaMA-2-7B touvron2023llama on the SCIQ dataset welbl2017crowdsourcing. Distribution shift is induced by injecting noisy yet “relevant” labels into in-context demonstrations following gao2024on. Panel (a) fixes the noisy ratio at 0.6; panel (b) fixes the number of in-context examples at 6. Results (averaged over 12 Monte Carlo runs) include error bars showing one standard deviation. Attention temperature of all the layers is set to $\tau\sqrt{d_k}$ for dimension independence, where $d_k$ denotes the key dimension of the corresponding layer. Furthermore, the dashed black line marks the “optimal temperature” computed from the variance-to-mean ratio of pre-softmax scores, which is an insight derived from Theorem \ref{['theorem:optimal_temperature']}, as explained in Appendix \ref{['appendix:insights_for_other_settings']}. Full experimental details appear in Appendix \ref{['appendix:experimental_details']}.
  • Figure 4: Comparison of linear and linearized attention under a shift in input mean. The plot illustrates the impact of a test-time shift in input mean on the performance of linear attention and linearized attention. While linear attention degrades under the distribution shift and fails to recover the Bayes-optimal performance, linearized attention remains robust and asymptotically matches the Bayes-optimal predictor as the number of context length $l$ increases.
  • Figure 5: Comparison of temperature effects of softmax, linearized softmax, and linear (with temperature scaling) cases. We consider an input vector ${\bm{x}} \in \mathbb{R}^l$ whose histogram is illustrated on the left-most plot. Rest of the plots illustrates histograms of the elements of $\text{softmax}({\bm{x}} / \tau)$, $\text{linearized\_softmax}({\bm{x}} /\tau)$ defined in (\ref{['eq:linearized_softmax_definition']}) and ${\bm{x}}/(l \tau)$ from left to right, respectively.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Definition 3.3: In-Context Learning (ICL)
  • Remark 3.4: Linear vs. linearized attention
  • Lemma 4.1: Pretrained Parameters
  • Remark 4.2
  • Remark 4.3
  • Corollary 4.4
  • Theorem 4.6: Generalization error for ICL
  • proof
  • Theorem 4.7: Optimal attention temperature
  • proof
  • ...and 3 more