Table of Contents
Fetching ...

From predictions to confidence intervals: an empirical study of conformal prediction methods for in-context learning

Zhe Huang, Simone Rossi, Rui Yuan, Thomas Hannagan

TL;DR

The paper tackles uncertainty quantification for in-context learning in transformers by marrying conformal prediction (CP) with in-context learning (ICL) to produce distribution-free prediction intervals with guaranteed coverage $1-\alpha$ in a single forward pass. It builds a bridge by using a pre-trained linear self-attention transformer to compute conformity scores across a finite grid of candidate labels $z$, enabling exact CP guarantees without retraining on augmented datasets. The authors benchmark CP with ICL against ridge-based oracle CP and split CP, showing robust coverage and competitive compute times, even under distribution shifts, and they uncover scaling laws that guide model size and data usage under compute budgets. The work provides a theoretically grounded, scalable framework that integrates ICL with CP for uncertainty quantification in transformer-based models, with potential extensions to autoregressive settings and real-world data.

Abstract

Transformers have become a standard architecture in machine learning, demonstrating strong in-context learning (ICL) abilities that allow them to learn from the prompt at inference time. However, uncertainty quantification for ICL remains an open challenge, particularly in noisy regression tasks. This paper investigates whether ICL can be leveraged for distribution-free uncertainty estimation, proposing a method based on conformal prediction to construct prediction intervals with guaranteed coverage. While traditional conformal methods are computationally expensive due to repeated model fitting, we exploit ICL to efficiently generate confidence intervals in a single forward pass. Our empirical analysis compares this approach against ridge regression-based conformal methods, showing that conformal prediction with in-context learning (CP with ICL) achieves robust and scalable uncertainty estimates. Additionally, we evaluate its performance under distribution shifts and establish scaling laws to guide model training. These findings bridge ICL and conformal prediction, providing a theoretically grounded and new framework for uncertainty quantification in transformer-based models.

From predictions to confidence intervals: an empirical study of conformal prediction methods for in-context learning

TL;DR

The paper tackles uncertainty quantification for in-context learning in transformers by marrying conformal prediction (CP) with in-context learning (ICL) to produce distribution-free prediction intervals with guaranteed coverage in a single forward pass. It builds a bridge by using a pre-trained linear self-attention transformer to compute conformity scores across a finite grid of candidate labels , enabling exact CP guarantees without retraining on augmented datasets. The authors benchmark CP with ICL against ridge-based oracle CP and split CP, showing robust coverage and competitive compute times, even under distribution shifts, and they uncover scaling laws that guide model size and data usage under compute budgets. The work provides a theoretically grounded, scalable framework that integrates ICL with CP for uncertainty quantification in transformer-based models, with potential extensions to autoregressive settings and real-world data.

Abstract

Transformers have become a standard architecture in machine learning, demonstrating strong in-context learning (ICL) abilities that allow them to learn from the prompt at inference time. However, uncertainty quantification for ICL remains an open challenge, particularly in noisy regression tasks. This paper investigates whether ICL can be leveraged for distribution-free uncertainty estimation, proposing a method based on conformal prediction to construct prediction intervals with guaranteed coverage. While traditional conformal methods are computationally expensive due to repeated model fitting, we exploit ICL to efficiently generate confidence intervals in a single forward pass. Our empirical analysis compares this approach against ridge regression-based conformal methods, showing that conformal prediction with in-context learning (CP with ICL) achieves robust and scalable uncertainty estimates. Additionally, we evaluate its performance under distribution shifts and establish scaling laws to guide model training. These findings bridge ICL and conformal prediction, providing a theoretically grounded and new framework for uncertainty quantification in transformer-based models.

Paper Structure

This paper contains 26 sections, 2 theorems, 46 equations, 7 figures, 1 table.

Key Result

theorem 1

Assume that the data points $\{({{\boldsymbol{{x}}}}_i, y_i)\}_{i=1}^{n}$ and the test point $({{\boldsymbol{{x}}}}_{n+1}, y_{n+1})$ are exchangeable. Consider the full conformal prediction set $\Gamma_\alpha({{\boldsymbol{{x}}}}_{n+1})$ constructed as: where the conformity scores $R_i(z)$ are computed as: Then, the prediction set $\Gamma_\alpha(x_{n+1})$ satisfies: where the probability is ove

Figures (7)

  • Figure 1: In-context learning with conformal prediction. The pre-training process consists of generating task sequences (\ref{['eq:tokenization']}) and optimizing the model parameters ${{\boldsymbol{{\theta}}}}$ to minimize the mean squared error between the predicted and true labels (\ref{['eq:pre-train-loss']}). During inference, we use the pre-trained model to predict the label $y_{n+1}^{(\tau)}$ and compute the prediction interval using conformal prediction.
  • Figure 2: Test coverage as a function of context size. In (a) both CP with ICL and CP with ridge converge to the theoretical value of $1 - \alpha = 0.90$ () as the context size increases, and the behavior of the two methods is similar. The comparison with split CP with ridge (b) shows that the latter has higher variance for smaller context sizes.
  • Figure 3: Coverage of CP with ICL across varying input dimensions (15, 30, and 50). Regardless of the input dimension, the test coverage converges to the theoretical value as the context size increases.
  • Figure 4: Wasserstein distance between predictive distributions as a function of $d/n$. Regardless of the input dimension $d$, the Wasserstein distance does not decrease monotonically with the data features ratio, with a peak at $d/n=1$.
  • Figure 5: Inference on a different distribution from pre-training. In (a) and (b) we show the test coverage as a function of the input range and the weight scale, respectively. In (c) and (d) we show the Wasserstein distance between the predictive distributions as a function of the input range and the weight scale, respectively.
  • ...and 2 more figures

Theorems & Definitions (8)

  • theorem 1: Marginal Coverage Guarantee
  • proof
  • remark 1
  • theorem 2: Coverage Guarantee with Symmetric Conformity Score
  • proof
  • remark 2
  • remark 3
  • remark 4