Scaling Laws and In-Context Learning: A Unified Theoretical Framework
Sushant Mehta, Ishan Gupta
TL;DR
This work tackles the mystery of when in-context learning (ICL) emerges in large transformers by proposing a unified theory that ties ICL performance to scaling laws. It treats ICL as a gradient-descent-like process implemented in the forward pass, yielding a power-law dependence on model size, data, and context, with exponents determined by task structure: $oldsymbol{oldsymbol{oldsymbol{\alpha}}} = frac{1}{2(h+1)}$ for hierarchy depth $h$. The authors derive a critical scale $oldsymbol{N_c \,∝\ (k h)^{2(h+1)}}$ signaling phase transitions to effective ICL, and identify optimal depth-width allocations under a fixed parameter budget: $oldsymbol{L^* \,∝\ N^{2/3}}$ and $oldsymbol{d^* \,∝\ N^{1/3}}$. Systematic experiments on synthetic tasks validate the theory with exponents matching predictions within a few percent, demonstrating both necessary and sufficient conditions for emergent ICL and revealing fundamental limits on what transformers can learn in-context. The framework provides practical design guidance for allocating depth and width to promote ICL and offers a principled lens to understand the computational boundaries of emergent capabilities in large-scale models.
Abstract
In-context learning (ICL) enables large language models to adapt to new tasks from demonstrations without parameter updates. Despite extensive empirical studies, a principled understanding of ICL emergence at scale remains more elusive. We present a unified theoretical framework connecting scaling laws to ICL emergence in transformers. Our analysis establishes that ICL performance follows power-law relationships with model depth $L$, width $d$, context length $k$, and training data $D$, with exponents determined by task structure. We show that under specific conditions, transformers implement gradient-based metalearning in their forward pass, with an effective learning rate $η_{\text{eff}} = Θ(1/\sqrt{Ld})$. We demonstrate sharp phase transitions at critical scales and derive optimal depth-width allocations favoring $L^* \propto N^{2/3}$, $d^* \propto N^{1/3}$ for the fixed parameter budget $N = Ld$. Systematic experiments on synthetic tasks validate our predictions, with measured scaling exponents closely matching theory. This work provides both necessary and sufficient conditions for the emergence of ICLs and establishes fundamental computational limits on what transformers can learn in-context.
