Table of Contents
Fetching ...

Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context

Taejong Joo, Diego Klabjan

TL;DR

This work evaluates how close in-context learning (ICL) in Transformers comes to the Bayes optimal estimator within a meta-ICL framework where target functions are drawn from a hierarchical prior. It introduces a stylized, information-theoretic benchmarking scheme comparing ICL to Bayes model averaging (BMA) and principled baselines, revealing a near-optimal performance in few-shot settings but a pronounced, intrinsic inefficiency in many-shot regimes due to non-vanishing excess risk. The authors decompose ICL error into irreducible Bayes risk and excess risk, and prove a suboptimality bound SubOpt(q) that grows as higher performance requirements are pursued. They further show that scaling context length or model size reduces but does not eliminate the non-vanishing excess risk, suggesting that ICL's update-free adaptability imposes a fundamental technical debt. The results advocate for developing on-the-fly adaptive methods that preserve update-free inference while improving sample efficiency in long contexts.

Abstract

Transformers have demonstrated remarkable in-context learning (ICL) capabilities, adapting to new tasks by simply conditioning on demonstrations without parameter updates. Compelling empirical and theoretical evidence suggests that ICL, as a general-purpose learner, could outperform task-specific models. However, it remains unclear to what extent the transformers optimally learn in-context compared to principled learning algorithms. To investigate this, we employ a meta ICL framework in which each prompt defines a distinctive regression task whose target function is drawn from a hierarchical distribution, requiring inference over both the latent model class and task-specific parameters. Within this setup, we benchmark sample complexity of ICL against principled learning algorithms, including the Bayes optimal estimator, under varying performance requirements. Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context. Through an information-theoretic analysis, we show that the diminishing efficiency is inherent to ICL. These results clarify the trade-offs in adopting ICL as a universal problem solver, motivating a new generation of on-the-fly adaptive methods without the diminishing efficiency.

Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context

TL;DR

This work evaluates how close in-context learning (ICL) in Transformers comes to the Bayes optimal estimator within a meta-ICL framework where target functions are drawn from a hierarchical prior. It introduces a stylized, information-theoretic benchmarking scheme comparing ICL to Bayes model averaging (BMA) and principled baselines, revealing a near-optimal performance in few-shot settings but a pronounced, intrinsic inefficiency in many-shot regimes due to non-vanishing excess risk. The authors decompose ICL error into irreducible Bayes risk and excess risk, and prove a suboptimality bound SubOpt(q) that grows as higher performance requirements are pursued. They further show that scaling context length or model size reduces but does not eliminate the non-vanishing excess risk, suggesting that ICL's update-free adaptability imposes a fundamental technical debt. The results advocate for developing on-the-fly adaptive methods that preserve update-free inference while improving sample efficiency in long contexts.

Abstract

Transformers have demonstrated remarkable in-context learning (ICL) capabilities, adapting to new tasks by simply conditioning on demonstrations without parameter updates. Compelling empirical and theoretical evidence suggests that ICL, as a general-purpose learner, could outperform task-specific models. However, it remains unclear to what extent the transformers optimally learn in-context compared to principled learning algorithms. To investigate this, we employ a meta ICL framework in which each prompt defines a distinctive regression task whose target function is drawn from a hierarchical distribution, requiring inference over both the latent model class and task-specific parameters. Within this setup, we benchmark sample complexity of ICL against principled learning algorithms, including the Bayes optimal estimator, under varying performance requirements. Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context. Through an information-theoretic analysis, we show that the diminishing efficiency is inherent to ICL. These results clarify the trade-offs in adopting ICL as a universal problem solver, motivating a new generation of on-the-fly adaptive methods without the diminishing efficiency.

Paper Structure

This paper contains 26 sections, 2 theorems, 20 equations, 10 figures.

Key Result

Theorem 4.2

Let us assume $(\bar{t}, \triangle_{\text{XS}})$ satisfies Assumption assumption:bounded_appx. For a sufficiently small $q$ such that ${\mathsfit{N}}_{\textsf{BMA}}(q) \geq \bar{t}$, it holds that where $\tilde{D}_{t+1}$ is a sample from the same distribution as $D_{t+1}$.

Figures (10)

  • Figure 1: Mean performance ratio of ICL against BMA across different performance requirements. The shaded areas represent the standard deviation of the corresponding performance ratio.
  • Figure 2: Performance profiles $\rho_b$ across different performance ratios $\tau$ under different target performance quantiles ${\mathcal{Q}}$. Each curve represents the probability that a method achieves the desired performance within a factor $\tau$ of the best method's sample complexity (x-axes). Figure \ref{['results:full_perf_prof']} in Appendix illustrates results for all ${\mathcal{Q}}$.
  • Figure 3: (a) Mean squared errors for different demonstration sizes. (b) Squared prediction differences between BMA and other methods for different demonstration sizes. Figure \ref{['results:full_mse']} and \ref{['results:full_spd']} in Appendix illustrates results for all $s \in {\mathcal{S}}$ and $b \in {\mathcal{B}}$.
  • Figure 4: Impacts of the pretraining prompt length (left) and the model size (right) on the excess risk curve in ${\mathcal{E}} = ([M], \sigma_y^2, \sigma_w^2) = ([10], 0.03, 10)$.
  • Figure A1: The excess risk curve in ${\mathcal{E}} = ([M], \sigma_y^2, \sigma_w^2) = ([10], 0.03, 10)$ under different pretraining prompt length for Mamba and LSTM.
  • ...and 5 more figures

Theorems & Definitions (7)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Theorem 4.2
  • Theorem 4.3
  • proof
  • proof