Table of Contents
Fetching ...

Demonstrations, CoT, and Prompting: A Theoretical Analysis of ICL

Xuhan Tong, Yuchen Zeng, Jiawei Zhang

Abstract

In-Context Learning (ICL) enables pretrained LLMs to adapt to downstream tasks by conditioning on a small set of input-output demonstrations, without any parameter updates. Although there have been many theoretical efforts to explain how ICL works, most either rely on strong architectural or data assumptions, or fail to capture the impact of key practical factors such as demonstration selection, Chain-of-Thought (CoT) prompting, the number of demonstrations, and prompt templates. We address this gap by establishing a theoretical analysis of ICL under mild assumptions that links these design choices to generalization behavior. We derive an upper bound on the ICL test loss, showing that performance is governed by (i) the quality of selected demonstrations, quantified by Lipschitz constants of the ICL loss along paths connecting test prompts to pretraining samples, (ii) an intrinsic ICL capability of the pretrained model, and (iii) the degree of distribution shift. Within the same framework, we analyze CoT prompting as inducing a task decomposition and show that it is beneficial when demonstrations are well chosen at each substep and the resulting subtasks are easier to learn. Finally, we characterize how ICL performance sensitivity to prompt templates varies with the number of demonstrations. Together, our study shows that pretraining equips the model with the ability to generalize beyond observed tasks, while CoT enables the model to compose simpler subtasks into more complex ones, and demonstrations and instructions enable it to retrieve similar or complex tasks, including those that can be composed into more complex ones, jointly supporting generalization to unseen tasks. All theoretical insights are corroborated by experiments.

Demonstrations, CoT, and Prompting: A Theoretical Analysis of ICL

Abstract

In-Context Learning (ICL) enables pretrained LLMs to adapt to downstream tasks by conditioning on a small set of input-output demonstrations, without any parameter updates. Although there have been many theoretical efforts to explain how ICL works, most either rely on strong architectural or data assumptions, or fail to capture the impact of key practical factors such as demonstration selection, Chain-of-Thought (CoT) prompting, the number of demonstrations, and prompt templates. We address this gap by establishing a theoretical analysis of ICL under mild assumptions that links these design choices to generalization behavior. We derive an upper bound on the ICL test loss, showing that performance is governed by (i) the quality of selected demonstrations, quantified by Lipschitz constants of the ICL loss along paths connecting test prompts to pretraining samples, (ii) an intrinsic ICL capability of the pretrained model, and (iii) the degree of distribution shift. Within the same framework, we analyze CoT prompting as inducing a task decomposition and show that it is beneficial when demonstrations are well chosen at each substep and the resulting subtasks are easier to learn. Finally, we characterize how ICL performance sensitivity to prompt templates varies with the number of demonstrations. Together, our study shows that pretraining equips the model with the ability to generalize beyond observed tasks, while CoT enables the model to compose simpler subtasks into more complex ones, and demonstrations and instructions enable it to retrieve similar or complex tasks, including those that can be composed into more complex ones, jointly supporting generalization to unseen tasks. All theoretical insights are corroborated by experiments.
Paper Structure (72 sections, 21 theorems, 157 equations, 6 figures, 3 tables)

This paper contains 72 sections, 21 theorems, 157 equations, 6 figures, 3 tables.

Key Result

Lemma 2.1

If $J\in{\mathbb{R}}$ is a finite interval, and $E \subset J$ is an an arbitrary measurable set, then for any polynomial $p$ of degree $n$,

Figures (6)

  • Figure 1: Empirical validation of (a) \ref{['thm:lipschitz']}, (b) \ref{['thm:cot_bound']}, and (c) \ref{['thm:decay_1_5', 'thm:decay_6']}, respectively.
  • Figure 2: Effect of structured distractors introduced during pretraining on CoT performance. Shakespeare less/mid/more denote progressively larger proportions of Shakespearean corpus mixed into the pretraining data. At evaluation time, the model is prompted to solve addition problems using CoT. Despite substantial irrelevant context, the model retains strong CoT performance at inference, indicating an ability to suppress task-irrelevant information.
  • Figure 3: Posterior confidence under instruction variation. (Qwen-235B-A22B-Instruct) (a) Equivalent instructions yield stable, high-confidence predictions. (b) Incorrect but consistent instructions allow confidence to recover as more demonstrations are added. (c) Incorrect and inconsistent instructions lead to persistently low and unstable confidence despite increasing context length.
  • Figure 4: Posterior confidence under instruction variation. (Qwen-30B-A3B-Instruct) (a) Equivalent instructions yield stable, high-confidence predictions. (b) Incorrect but consistent instructions allow confidence to recover as more demonstrations are added. (c) Incorrect and inconsistent instructions lead to persistently low and unstable confidence despite increasing context length.
  • Figure 5: Discrete gradients of log probability. (Qwen-235B-A22B-Instruct) (a) Equivalent instructions yield small and stable gradients. (b) Incorrect but consistent instructions show gradual stabilization.(c) Incorrect and inconsistent instructions produce large and irregular gradients.
  • ...and 1 more figures

Theorems & Definitions (38)

  • Lemma 2.1: Remez Inequality for Polynomials remez1936
  • Lemma 2.2: Bernstein Approximation Theorem, Lipschitz Case Lorentz1986
  • Theorem 2.3: (Informal) ICL Generalization Bound
  • proof : Proof Sketch
  • Remark 2.4
  • Remark 2.5
  • Theorem 3.1: (Informal) ICL Generalization Bound with CoT Prompting
  • Remark 3.2
  • Theorem 4.3: Exponential Convergence of Posterior Predictive (Formats 1--5)
  • Corollary 4.4: Gradient Decay of Prompt Sensitivity
  • ...and 28 more