Table of Contents
Fetching ...

Gradient Boosting within a Single Attention Layer

Saleh Sargolzaei

Abstract

Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of $67.9$ compared to $72.2$ for standard attention, $69.6$ for Twicing Attention, and $69.0$ for a parameter-matched wider baseline, with two rounds capturing most of the benefit.

Gradient Boosting within a Single Attention Layer

Abstract

Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of compared to for standard attention, for Twicing Attention, and for a parameter-matched wider baseline, with two rounds capturing most of the benefit.

Paper Structure

This paper contains 47 sections, 3 theorems, 6 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Let $X \in \mathbb{R}^{d \times N}$ be a matrix of stored patterns and $T(\boldsymbol{\xi}) = X\,\mathrm{softmax}(\beta X^\top \boldsymbol{\xi})$ the Hopfield update (eq:hopfield). Then: $\blacktriangleleft$$\blacktriangleleft$

Figures (5)

  • Figure 1: (a) Standard attention computes a single softmax-weighted average. (b) Gradient-boosted attention ($M{=}2$) adds a second pass that attends to the prediction error $\mathbf{r} = \mathbf{x} - \hat{\mathbf{y}}_0$ with separate projections $W_Q^{(1)}, W_K^{(1)}, W_V^{(1)}$. A learned gate $\mathbf{g}$ controls the per-dimension correction magnitude.
  • Figure 2: Left: WikiText-103 test perplexity (zoomed axis). Gradient-boosted attention outperforms all baselines including Twicing and a parameter-matched wider model. Right: Retrieval accuracy on the synthetic denoising task as a function of boosting rounds. The jump from 1 to 2 rounds captures most of the improvement.
  • Figure 3: Learned gate values per dimension for each transformer layer, averaged over 50 test sequences. The dashed line marks $g = 0.5$. Gate magnitudes and variation differ across layers, with layer 1 applying the strongest and most selective correction.
  • Figure 4: Left: Distribution of attention entropy across all layers and heads for round 0 (initial) and round 1 (correction). The correction round is 22% lower-entropy on average. Right: Mean entropy per layer. Layers 1--2 show the sharpest correction attention, coinciding with higher gate values (Figure \ref{['fig:gate_analysis']}).
  • Figure 5: Three tokens where gradient-boosted attention corrects a prediction error. Blue bars show round 0 attention (initial, diffuse); red bars show round 1 attention (correction, concentrated on relevant context). Each title shows the target token, the standard model's prediction, and the boosted model's prediction with cross-entropy loss. Layer 1, head-averaged.

Theorems & Definitions (8)

  • Proposition 1: One-step projection and information loss
  • proof
  • Remark 1
  • Proposition 2: MART equivalence
  • proof
  • Remark 2
  • Proposition 3: Limitation of shared attention in Twicing
  • proof