Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context
Taejong Joo, Diego Klabjan
TL;DR
This work evaluates how close in-context learning (ICL) in Transformers comes to the Bayes optimal estimator within a meta-ICL framework where target functions are drawn from a hierarchical prior. It introduces a stylized, information-theoretic benchmarking scheme comparing ICL to Bayes model averaging (BMA) and principled baselines, revealing a near-optimal performance in few-shot settings but a pronounced, intrinsic inefficiency in many-shot regimes due to non-vanishing excess risk. The authors decompose ICL error into irreducible Bayes risk and excess risk, and prove a suboptimality bound SubOpt(q) that grows as higher performance requirements are pursued. They further show that scaling context length or model size reduces but does not eliminate the non-vanishing excess risk, suggesting that ICL's update-free adaptability imposes a fundamental technical debt. The results advocate for developing on-the-fly adaptive methods that preserve update-free inference while improving sample efficiency in long contexts.
Abstract
Transformers have demonstrated remarkable in-context learning (ICL) capabilities, adapting to new tasks by simply conditioning on demonstrations without parameter updates. Compelling empirical and theoretical evidence suggests that ICL, as a general-purpose learner, could outperform task-specific models. However, it remains unclear to what extent the transformers optimally learn in-context compared to principled learning algorithms. To investigate this, we employ a meta ICL framework in which each prompt defines a distinctive regression task whose target function is drawn from a hierarchical distribution, requiring inference over both the latent model class and task-specific parameters. Within this setup, we benchmark sample complexity of ICL against principled learning algorithms, including the Bayes optimal estimator, under varying performance requirements. Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context. Through an information-theoretic analysis, we show that the diminishing efficiency is inherent to ICL. These results clarify the trade-offs in adopting ICL as a universal problem solver, motivating a new generation of on-the-fly adaptive methods without the diminishing efficiency.
