Table of Contents
Fetching ...

Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

Jose Marie Antonio Miñoza, Paulo Mario P. Medina, Sebastian C. Ibañez

Abstract

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix's condition number, requiring width $m = Ω(κ^6)$ for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6--9$\times$ higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention's power and vulnerability share a common origin in its departure from the kernel regime.

Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

Abstract

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix's condition number, requiring width for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6--9 higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention's power and vulnerability share a common origin in its departure from the kernel regime.
Paper Structure (54 sections, 27 theorems, 44 equations, 5 figures, 6 tables)

This paper contains 54 sections, 27 theorems, 44 equations, 5 figures, 6 tables.

Key Result

Theorem 4.1

Let $\mathbf{X} \in \mathbb{R}^{n \times d}$ with $\|\mathbf{x}_i\|_2 = 1$ for all $i$. The kernel induced by linearized attention $f^{\text{att}}(\mathbf{X}) = \mathbf{X}\mathbf{X}^T\mathbf{X}$ is: where $k$ and $\ell$ are summation indices over all $n$ training samples $\mathbf{x}_1, \ldots, \mathbf{x}_n$ in $\mathbf{X}$. This is a data-dependent kernel induced by the Gram matrix $\mathbf{X}\ma

Figures (5)

  • Figure 1: NTK distance $\|f_m - f_{\text{NTK}}\|$ across network widths, where $f_m$ is the finite-width trained model and $f_{\text{NTK}}$ is the infinite-width NTK predictor. 2L-ReLU (blue) shows expected convergence: distance decreases as $m \to \infty$. MLP-Attn (orange) shows fundamentally different behavior: distance fails to decrease monotonically on either dataset (non-monotonic on MNIST, increasing on CIFAR-10), demonstrating that linearized attention never enters the NTK regime.
  • Figure 2: Analysis of influence dynamics. (a) MLP-Attn exhibits consistently higher influence malleability (flip rate) across all perturbation types, reflecting its operation in the feature learning regime. (b) Rank correlation analysis reveals that 2L-ReLU maintains rigid influence rankings while MLP-Attn shows lower correlation, indicating continuous re-evaluation of data dependencies. The "Transformed" intervention (replacing influential examples with adversarial versions) produces the most structured adaptation pattern.
  • Figure 3: Top influential training examples for a representative test digit. Positive influencers (top rows) are examples whose removal increases test loss; negative influencers (bottom rows) decrease test loss when removed. While visual differences between architectures may be subtle, the key distinction is quantitative: MLP-Attn's influence scores are more sensitive to perturbations (28.9% flip rate vs. 3.3% for 2L-ReLU under PGD, Table \ref{['tab:malleability']}), reflecting the Gram-induced kernel's data-dependent structure.
  • Figure 4: Contribution to model complexity (blue) and average influence score (red) for MLP-Attn across three datasets. The U-shaped complexity curve indicates that the most influential points (both harmful and helpful) contribute most to model complexity, consistent with findings of zhang2022rethinking.
  • Figure 5: Loss landscape comparison (MNIST) at finite width $m = 1024$ (left two panels) and infinite-width NTK limit (right two panels). 2L-ReLU (blue) converges toward its NTK landscape: both finite and infinite-width surfaces share similar geometry. MLP-Attn (orange) shows a qualitatively different finite-width landscape (sharp, deep minimum) compared to its NTK limit (broad, shallow basin), visualizing the NTK non-convergence from a loss geometry perspective. NTK distances: 2L-ReLU $= 39.93$, MLP-Attn $= 33.30$.

Theorems & Definitions (57)

  • Definition 3.1: Linearized Attention
  • Remark 3.2: Correspondence to Standard Attention
  • Definition 3.3: Influence Function
  • Definition 3.4: Influence Malleability
  • Theorem 4.1: Data-Dependent Gram-Induced Kernel
  • Theorem 4.2: NTK for Sequential Architecture
  • Theorem 4.3: Influence Function Stability
  • Proposition 4.5: Data-Dependent Kernel Sensitivity
  • Remark 4.6: Connection to Feature Learning
  • Theorem 4.7: Spectral Amplification and NTK Non-Convergence
  • ...and 47 more