Fast Gradient Computation for RoPE Attention in Almost Linear Time

Yifang Chen; Jiayan Huo; Xiaoyu Li; Yingyu Liang; Zhenmei Shi; Zhao Song

Fast Gradient Computation for RoPE Attention in Almost Linear Time

Yifang Chen, Jiayan Huo, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song

TL;DR

The paper tackles the bottleneck of backward gradient computation for RoPE-based attention by introducing an algorithm that achieves almost linear time $n^{1+o(1)}$ under a bounded-entries assumption, using a novel combination of polynomial approximation of exponentials and Fast Fourier Transform techniques. It provides a closed-form gradient, strict time analyses for each gradient component, and a low-rank framework to approximate the gradient with provable error bounds. A key contribution is linking these algorithmic advances to conditional hardness results via SETH, showing the bounded-entry condition is necessary for subquadratic performance. The work offers a path toward scalable RoPE-enabled transformers and broadens the understanding of the algorithmic complexity of attention mechanisms in large language models.

Abstract

The Rotary Position Embedding (RoPE) mechanism has become a powerful enhancement to the Transformer architecture, which enables models to capture token relationships when encoding positional information. However, the RoPE mechanisms make the computations of attention mechanisms more complicated, which makes efficient algorithms challenging. Earlier research introduced almost linear time, i.e., $n^{1+o(1)}$ where $n$ is the number of input tokens, algorithms for the forward computation under specific parameter settings. However, achieving a subquadratic time algorithm for other parameter regimes remains impossible unless the widely accepted Strong Exponential Time Hypothesis (SETH) is disproven. In this work, we develop the first almost linear time algorithm for backward computations in the RoPE-based attention under bounded entries. Our approach builds on recent advancements in fast RoPE attention computations, utilizing a novel combination of the polynomial method and the Fast Fourier Transform. Furthermore, we show that with lower bounds derived from the SETH, the bounded entry condition is necessary for subquadratic performance.

Fast Gradient Computation for RoPE Attention in Almost Linear Time

TL;DR

The paper tackles the bottleneck of backward gradient computation for RoPE-based attention by introducing an algorithm that achieves almost linear time

under a bounded-entries assumption, using a novel combination of polynomial approximation of exponentials and Fast Fourier Transform techniques. It provides a closed-form gradient, strict time analyses for each gradient component, and a low-rank framework to approximate the gradient with provable error bounds. A key contribution is linking these algorithmic advances to conditional hardness results via SETH, showing the bounded-entry condition is necessary for subquadratic performance. The work offers a path toward scalable RoPE-enabled transformers and broadens the understanding of the algorithmic complexity of attention mechanisms in large language models.

Abstract

where

is the number of input tokens, algorithms for the forward computation under specific parameter settings. However, achieving a subquadratic time algorithm for other parameter regimes remains impossible unless the widely accepted Strong Exponential Time Hypothesis (SETH) is disproven. In this work, we develop the first almost linear time algorithm for backward computations in the RoPE-based attention under bounded entries. Our approach builds on recent advancements in fast RoPE attention computations, utilizing a novel combination of the polynomial method and the Fast Fourier Transform. Furthermore, we show that with lower bounds derived from the SETH, the bounded entry condition is necessary for subquadratic performance.

Paper Structure (31 sections, 18 theorems, 46 equations)

This paper contains 31 sections, 18 theorems, 46 equations.

Introduction
Roadmap.
Related Work
Rotary Position Embedding.
Fast Attention Computation.
Gradient Approximation.
Theoretical Foundation of LLMs.
Preliminaries on RoPE Attention
Notation
Problem Definition
Polynomial Approximation of Exponential
Time Complexity of Multiplications
SETH Hypothesis
Basic Facts
Useful Definitions
...and 16 more sections

Key Result

Lemma 3.4

Let $B > 1$ and suppose $\epsilon$ in $(0,0.1)$. We can have $P$, which has input as a scalar and output as a scalar of degree $g$. $g$ is defined as $\Theta\left( \max \left\{ \log(1/\epsilon)/( \log( \log(1/\epsilon) / B) ) , B \right\}\right)$ such that for all $x \in [0,B]$, we can get Because $P$'s coefficients are rational values with numerators and denominators represented using integers

Theorems & Definitions (45)

Definition 3.1: A General Approximate RoPE Attention Computation, $\mathsf{ARAttC}$, Definition 1.1 in as24_rope
Definition 3.2: Optimize RoPE Attention
Definition 3.3: The Approx of the gradient of RoPE Attention Loss Function, $\mathsf{ARAttLGC}(n, d, B, \epsilon)$
Lemma 3.4: aa22
Definition 3.5
proof
Definition 3.12: Softmax $u(x)$
Definition 3.13: Diagonal matrix $\alpha(x)$
Definition 3.14: Normalized softmax $s(x)$
Definition 3.15: Value matrix $v(y)$
...and 35 more

Fast Gradient Computation for RoPE Attention in Almost Linear Time

TL;DR

Abstract

Fast Gradient Computation for RoPE Attention in Almost Linear Time

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (45)