Table of Contents
Fetching ...

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao

TL;DR

The paper tackles the theoretical understanding of in-context learning (ICL) for regression on manifolds by linking Transformer attention to kernel methods. It proves that attention can implement kernel regression exactly and uses this to derive a generalization bound for Transformer-based ICL that scales with the prompt length and the number of training tasks, while depending exponentially on the intrinsic dimension of the data manifold. The results show that, with enough tasks, Transformers achieve near-minimax rates for Hölder functions on manifolds and that the geometry of the data governs generalization more than ambient dimensionality. This work provides a principled geometry-aware framework for analyzing nonlinear ICL models and suggests new tools for studying ICL in structured domains. The theoretical contributions are complemented by numerical experiments demonstrating the kernel-like behavior of attention and validating the proposed bounds.

Abstract

While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of Hölder functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for Hölder functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of Hölder functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

TL;DR

The paper tackles the theoretical understanding of in-context learning (ICL) for regression on manifolds by linking Transformer attention to kernel methods. It proves that attention can implement kernel regression exactly and uses this to derive a generalization bound for Transformer-based ICL that scales with the prompt length and the number of training tasks, while depending exponentially on the intrinsic dimension of the data manifold. The results show that, with enough tasks, Transformers achieve near-minimax rates for Hölder functions on manifolds and that the geometry of the data governs generalization more than ambient dimensionality. This work provides a principled geometry-aware framework for analyzing nonlinear ICL models and suggests new tools for studying ICL in structured domains. The theoretical contributions are complemented by numerical experiments demonstrating the kernel-like behavior of attention and validating the proposed bounds.

Abstract

While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of Hölder functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for Hölder functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of Hölder functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.

Paper Structure

This paper contains 30 sections, 10 theorems, 145 equations, 5 figures, 1 table.

Key Result

Lemma 1

Let $\mathcal{M}\subset [-b,b]^D$. Suppose the prompt $\mathfrak{s}$ in eq:promp satisfies: the $\mathbf{x}_i$'s are i.i.d. samples from a distribution $\rho_{\mathbf{x}}$ supported on $\mathcal{M}$ and $f:\mathcal{M}\to\mathbb{R}$ is bounded, i.e. $\|f\|_{L^{\infty}(\mathcal{M})}\leq R$. Let $\math such that for any sample $\mathfrak{s}$ in the form of eq:promp, we have The notation $O(\cdot)$ h

Figures (5)

  • Figure 1: Examples of attention scores and Gaussian kernel with in-context length $n=8$ (first column), $n=16$ (second column), $n=32$ (third column) respectively. The top and bottom rows are the plots at two different samples. This figure shows a strong correlation between attention scores and Gaussian kernel.
  • Figure 2: Histograms of the Pearson correlation for $n=4,8,16$ respectively. The ones with negative correlation are not included in this plot, while they only account for a small amount. The total counts for positive correlation are $4588,4598,4771$ out of a total of $5000$ samples in each case respectively.
  • Figure 3: Softmax attention scores for real language data.
  • Figure 4: Top row: MSE v.s. number of tasks $\Gamma$ (with fixed prompt length $n = 16, 64, 256$, respectively). Bottom row: MSE v.s. prompt length $n$ (with fixed tasks $\Gamma = 400, 1600, 6400$ respectively). All plots are in log10-log10 scale.
  • Figure 5: More examples of attention scores and Gaussian kernel function with in-context length $n=4, 8,16,32$ respectively.

Theorems & Definitions (28)

  • Definition 1: Hölder function on a manifold
  • Definition 2: Attention and Multi-head Attention
  • Definition 3: Transformer Network Class
  • Lemma 1
  • Remark 1: Universality
  • Theorem 1
  • Definition 4: Medial Axis
  • Definition 5: Local Reach and Reach of a Manifold
  • Definition 6: Covering Number
  • Definition 7: Embedding Layer
  • ...and 18 more