Table of Contents
Fetching ...

Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization

Zixuan Gong, Xiaolin Hu, Huayi Tang, Yong Liu

TL;DR

The paper tackles why in-context learning (ICL) emerges in large language models under auto-regressive next-token prediction (AR-NTP) where prompt tokens are interdependent. It builds a two-level PAC-Bayesian generalization framework that couples pre-training data-topic distributions with ICL prompts, using ghost sequences to manage autoregressive dependencies and data-dependent priors to bound KL terms. The main contributions are new data-dependent, topic-dependent, and optimization-dependent bounds for pre-trained LLMs and experimental validation on linear dynamic systems, synthetic GINC data, and real language tasks. The results show that ICL arises from the generalization of both sequences and topics, informing practical guidelines for pre-training data scale, prompt length, and prior initialization.

Abstract

Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: (a) Limited i.i.d. Setting. Most studies focus on supervised function learning tasks where prompts are constructed with i.i.d. input-label pairs. This i.i.d. assumption diverges significantly from real language learning scenarios where prompt tokens are interdependent. (b) Lack of Emergence Explanation. Most literature answers what ICL does from an implicit optimization perspective but falls short in elucidating how ICL emerges and the impact of pre-training phase on ICL. In our paper, to extend (a), we adopt a more practical paradigm, auto-regressive next-token prediction (AR-NTP), which closely aligns with the actual training of language models. Specifically, within AR-NTP, we emphasize prompt token-dependency, which involves predicting each subsequent token based on the preceding sequence. To address (b), we formalize a systematic pre-training and ICL framework, highlighting the layer-wise structure of sequences and topics, alongside a two-level expectation. In conclusion, we present data-dependent, topic-dependent and optimization-dependent PAC-Bayesian generalization bounds for pre-trained LLMs, investigating that ICL emerges from the generalization of sequences and topics. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.

Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization

TL;DR

The paper tackles why in-context learning (ICL) emerges in large language models under auto-regressive next-token prediction (AR-NTP) where prompt tokens are interdependent. It builds a two-level PAC-Bayesian generalization framework that couples pre-training data-topic distributions with ICL prompts, using ghost sequences to manage autoregressive dependencies and data-dependent priors to bound KL terms. The main contributions are new data-dependent, topic-dependent, and optimization-dependent bounds for pre-trained LLMs and experimental validation on linear dynamic systems, synthetic GINC data, and real language tasks. The results show that ICL arises from the generalization of both sequences and topics, informing practical guidelines for pre-training data scale, prompt length, and prior initialization.

Abstract

Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: (a) Limited i.i.d. Setting. Most studies focus on supervised function learning tasks where prompts are constructed with i.i.d. input-label pairs. This i.i.d. assumption diverges significantly from real language learning scenarios where prompt tokens are interdependent. (b) Lack of Emergence Explanation. Most literature answers what ICL does from an implicit optimization perspective but falls short in elucidating how ICL emerges and the impact of pre-training phase on ICL. In our paper, to extend (a), we adopt a more practical paradigm, auto-regressive next-token prediction (AR-NTP), which closely aligns with the actual training of language models. Specifically, within AR-NTP, we emphasize prompt token-dependency, which involves predicting each subsequent token based on the preceding sequence. To address (b), we formalize a systematic pre-training and ICL framework, highlighting the layer-wise structure of sequences and topics, alongside a two-level expectation. In conclusion, we present data-dependent, topic-dependent and optimization-dependent PAC-Bayesian generalization bounds for pre-trained LLMs, investigating that ICL emerges from the generalization of sequences and topics. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.

Paper Structure

This paper contains 50 sections, 19 theorems, 130 equations, 7 figures, 2 tables.

Key Result

Theorem 4.3

Let the auto-regressive LLM $\mathbb{P}_\theta$ be the empirical solution of Equation eq-L-E, and $\mathbb{P}(\cdot\mid w)$ denotes the true data distribution under topic $w$. Under Assumptions ass:B and ass: lipschitz, for any $0<\delta < 1$, with probability at least $1-\delta$, the first-level ex then considering data-dependent prior $\nu_J$ and detailing the term $D_{\mathrm{KL}}(\mu\parallel\

Figures (7)

  • Figure 1: Overview of Pre-training and In-context Learning Framework.
  • Figure 2: Experiments on GINC and Real-world Language Datasets.
  • Figure 3: Overview of Two-Level Expectation.From a horizontal perspective:The first box (from top to bottom): according to Equation \ref{['app-eq-L-decompose']}, the population loss is decomposed into four parts. We ultimately obtain the upper bound of the population loss by separately defining the upper bound for each part. Combining Part $\text{I}$, Part $\text{II}$ and Part $\text{III}$, we obtain Theorem \ref{['pre-gen-data-dependent']}; further combining with Part $\text{IV}$, we obtain Theorem \ref{['ICL-gen-topic-dependent']}. The second box: comparing $L(\theta)$ and $L(\theta,\mathcal{W}_{\text{pre}})$, we aim to describe the second-level expectation defined over topic. The third box: comparing $L(\theta,w_k)$ and $L^\prime_{x}(\theta,w_k)$, we aim to describe the complete first-level expectation defined over sequence. The fourth box: comparing $L^\prime_{x}(\theta,w_k)$ and $L_{x}(\theta,w_k)$, $L^\prime_{E^{k,n}}(\theta,w_k)$ is a partial first-level expectation over token $x^{k,n}_{t+1}$ conditioned on $E^{k,n}_t$. The fifth box: Negative logarithmic likelihood loss, the optimization objective for a token. From a vertical perspective, the formulas described in the four columns can be found in Equation \ref{['eq-L-2']}, \ref{['eq-L-W']}, \ref{['eq-L-E-prime']} and \ref{['eq-L-E-complete']}, respectively. The first column: the chain of $L(\theta) \rightarrow L(\theta,w_k) \rightarrow L^\prime_{x}(\theta,w_k) \rightarrow L_{x}(\theta,w_k)$. The second column: the chain of $L(\theta,\mathcal{W}_{\text{pre}}) \rightarrow L(\theta,w_k) \rightarrow L^\prime_{x}(\theta,w_k) \rightarrow L_{x}(\theta,w_k)$. The third column: the chain of $L^\prime(\theta,\mathcal{W}_{\text{pre}}) \rightarrow L^\prime_{E^{k}}(\theta,w_k) \rightarrow L^\prime_{x}(\theta,w_k) \rightarrow L_{x}(\theta,w_k)$. The fourth column: the chain of $L_E(\theta,\mathcal{W}_{\text{pre}}) \rightarrow L_{E^k}(\theta,w_k) \rightarrow L_{x}(\theta,w_k)$.
  • Figure 4: Experiments on Linear Dynamic System. Left: The comparison of overall loss and in-context learning loss. Right: The comparison of experiments conducted on complete topic set and subset of topics.
  • Figure 5: Experiments on Linear Dynamic System: The effect of the number of pre-training topics ($K$), the number of sequences per topic ($N$) and sequence length ($T$).
  • ...and 2 more figures

Theorems & Definitions (38)

  • Theorem 4.3: Data-Dependent and Optimization-Dependent Generalization Bound of the First-level Expected Loss
  • Remark 4.4
  • Theorem 4.6: Data-Dependent, Topic-Dependent and Optimization-Dependent Generalization Bound of the Two-level Expected Loss
  • Remark 4.7: Optimality Analysis
  • Theorem F.1: Generalization Bound of the First-Level Expected Loss
  • Remark F.2
  • Theorem F.3: Data-Dependent and Optimization-Dependent Generalization Bound of the First-Level Expected Loss
  • Remark F.4
  • Theorem F.5: Data-Dependent and Optimization-Dependent Generalization Bound of the Two-Level Expected Loss
  • Remark F.6
  • ...and 28 more