In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

Yeqi Gao; Zhao Song; Shenghao Xie

In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

Yeqi Gao, Zhao Song, Shenghao Xie

TL;DR

This work provides a rigorous, Lipschitz-based analysis of in-context learning for attention-related regression in large language models. By recasting matrix attention formulations into a vectorized form using the tensor-trick, it derives precise equivalences between matrix and vector regressions and computes gradient/derivative bounds for both normalized and rescaled softmax losses. The paper establishes comprehensive Lipschitz constants for all building blocks (u, alpha, f, h, c, q) and their A-variants, and then translates these into stability results for in-context learning updates in both the Normalized and Rescaled settings. The main contributions are twofold: (i) a thorough operator- and elementwise-bound framework that quantifies how context-induced changes propagate to the loss and gradients, and (ii) explicit in-context learning theorems that bound the regression error under sequential context updates, enabling deeper theoretical understanding of ICL in transformer-like attention schemes.

Abstract

Large language models (LLMs) have brought significant and transformative changes in human society. These models have demonstrated remarkable capabilities in natural language understanding and generation, leading to various advancements and impacts across several domains. We consider the in-context learning under two formulation for attention related regression in this work. Given matrices $A_1 \in \mathbb{R}^{n \times d}$, and $A_2 \in \mathbb{R}^{n \times d}$ and $B \in \mathbb{R}^{n \times n}$, the purpose is to solve some certain optimization problems: Normalized version $\min_{X} \| D(X)^{-1} \exp(A_1 X A_2^\top) - B \|_F^2$ and Rescaled version $\| \exp(A_1 X A_2^\top) - D(X) \cdot B \|_F^2$. Here $D(X) := \mathrm{diag}( \exp(A_1 X A_2^\top) {\bf 1}_n )$. Our regression problem shares similarities with previous studies on softmax-related regression. Prior research has extensively investigated regression techniques related to softmax regression: Normalized version $\| \langle \exp(Ax) , {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2$ and Resscaled version $\| \exp(Ax) - \langle \exp(Ax), {\bf 1}_n \rangle b \|_2^2 $ In contrast to previous approaches, we adopt a vectorization technique to address the regression problem in matrix formulation. This approach expands the dimension from $d$ to $d^2$, resembling the formulation of the regression problem mentioned earlier. Upon completing the lipschitz analysis of our regression function, we have derived our main result concerning in-context learning.

In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

TL;DR

Abstract

, and

and

, the purpose is to solve some certain optimization problems: Normalized version

and Rescaled version

. Here

. Our regression problem shares similarities with previous studies on softmax-related regression. Prior research has extensively investigated regression techniques related to softmax regression: Normalized version

and Resscaled version

In contrast to previous approaches, we adopt a vectorization technique to address the regression problem in matrix formulation. This approach expands the dimension from

, resembling the formulation of the regression problem mentioned earlier. Upon completing the lipschitz analysis of our regression function, we have derived our main result concerning in-context learning.

Paper Structure (66 sections, 42 theorems, 158 equations)

This paper contains 66 sections, 42 theorems, 158 equations.

Introduction
Recent softmax regression
The matrix formulation for attention regression
Turning matrix formulation to vector formulation
Our Results
Normalized Version
Rescaled Version
Lipschitz of Gradient
Roadmap.
Related Work
In-Context Learning
Transformer Theory
Preliminary
Notations.
Facts
...and 51 more sections

Key Result

Theorem 2.1

Provided that the subsequent requirement are satisfied We consider the matrix formulation for attention regression (Definition def:intro_normalized_matrix)

Theorems & Definitions (99)

Definition 1.1
Definition 1.2: Normalized version
Definition 1.3: Rescaled version
Definition 1.4: Vector (equivalence) version of Definition \ref{['def:intro_normalized_matrix']}
Definition 1.5: Vector (equivalence) version of Definition \ref{['def:intro_rescaled_matrix']}
Theorem 2.1: Learning in-context for Normalized Version
Theorem 2.2: Learning in-context for Rescaled Version
Corollary 2.3
Definition 5.1
Definition 5.2
...and 89 more

In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

TL;DR

Abstract

In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (99)