Table of Contents
Fetching ...

Circuit Complexity Bounds for RoPE-based Transformer Architecture

Bo Chen, Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song

TL;DR

This work establishes a circuit- complexity bound forRoPE-based Transformer architectures, proving that such models can be simulated by uniform $\mathsf{TC}^0$ circuits under polynomial precision, constant depth, and $d \le O(n)$, while also showing that, unless $TC^0 = NC^1$, they cannot solve fundamental problems like Arithmetic Formula Evaluation or Boolean Formula Value. The analysis proceeds by decomposing RoPE into tractable components (trigonometric functions, matrix products, attention, and MLP/LN blocks) and placing each within $\mathsf{TC}^0$, then aggregating to the full multi-layer transformer. The key contributions are (i) a detailed per-component $\mathsf{TC}^0$-computation bound for RoPE-based transformers, (ii) the main result that RoPE transformers are simulable in $\mathsf{TC}^0$, and (iii) hardness theorems showing expressivity limits relative to $TC^0$ vs $NC^1$ separations. These results illuminate fundamental theoretical constraints on RoPE-based architectures, offering guidance for future theoretical and practical exploration of their capabilities and limitations.

Abstract

Characterizing the express power of the Transformer architecture is critical to understanding its capacity limits and scaling law. Recent works provide the circuit complexity bounds to Transformer-like architecture. On the other hand, Rotary Position Embedding ($\mathsf{RoPE}$) has emerged as a crucial technique in modern large language models, offering superior performance in capturing positional information compared to traditional position embeddings, which shows great potential in application prospects, particularly for the long context scenario. Empirical evidence also suggests that $\mathsf{RoPE}$-based Transformer architectures demonstrate greater generalization capabilities compared to conventional Transformer models. In this work, we establish a circuit complexity bound for Transformers with $\mathsf{RoPE}$ attention. Our key contribution is that we show that unless $\mathsf{TC}^0 = \mathsf{NC}^1$, a $\mathsf{RoPE}$-based Transformer with $\mathrm{poly}(n)$-precision, $O(1)$ layers, hidden dimension $d \leq O(n)$ cannot solve the Arithmetic formula evaluation problem or the Boolean formula value problem. This result significantly demonstrates the fundamental limitation of the expressivity of the $\mathsf{RoPE}$-based Transformer architecture, although it achieves giant empirical success. Our theoretical result not only establishes the complexity bound but also may instruct further work on the $\mathsf{RoPE}$-based Transformer.

Circuit Complexity Bounds for RoPE-based Transformer Architecture

TL;DR

This work establishes a circuit- complexity bound forRoPE-based Transformer architectures, proving that such models can be simulated by uniform circuits under polynomial precision, constant depth, and , while also showing that, unless , they cannot solve fundamental problems like Arithmetic Formula Evaluation or Boolean Formula Value. The analysis proceeds by decomposing RoPE into tractable components (trigonometric functions, matrix products, attention, and MLP/LN blocks) and placing each within , then aggregating to the full multi-layer transformer. The key contributions are (i) a detailed per-component -computation bound for RoPE-based transformers, (ii) the main result that RoPE transformers are simulable in , and (iii) hardness theorems showing expressivity limits relative to vs separations. These results illuminate fundamental theoretical constraints on RoPE-based architectures, offering guidance for future theoretical and practical exploration of their capabilities and limitations.

Abstract

Characterizing the express power of the Transformer architecture is critical to understanding its capacity limits and scaling law. Recent works provide the circuit complexity bounds to Transformer-like architecture. On the other hand, Rotary Position Embedding () has emerged as a crucial technique in modern large language models, offering superior performance in capturing positional information compared to traditional position embeddings, which shows great potential in application prospects, particularly for the long context scenario. Empirical evidence also suggests that -based Transformer architectures demonstrate greater generalization capabilities compared to conventional Transformer models. In this work, we establish a circuit complexity bound for Transformers with attention. Our key contribution is that we show that unless , a -based Transformer with -precision, layers, hidden dimension cannot solve the Arithmetic formula evaluation problem or the Boolean formula value problem. This result significantly demonstrates the fundamental limitation of the expressivity of the -based Transformer architecture, although it achieves giant empirical success. Our theoretical result not only establishes the complexity bound but also may instruct further work on the -based Transformer.

Paper Structure

This paper contains 26 sections, 22 theorems, 21 equations.

Key Result

Lemma 3.16

Let $p$ be a positive integer. If $p \leq \mathop{\mathrm{poly}}\nolimits(n)$, then the following statements hold:

Theorems & Definitions (60)

  • Definition 3.1: Boolean circuit, Definition 6.1 on page 102 of ab09
  • Definition 3.2: Languages recognized by a circuit family, Definition 6.2 on page 103 of ab09
  • Definition 3.3: $\mathsf{NC}^i$, Definition 6.21 on page 109 of ab09
  • Definition 3.4: $\mathsf{AC}^i$, Definition 6.22 on page 109 of ab09
  • Definition 3.5: $\mathsf{TC}^i$, Definition 4.34 on page 126 of vol99
  • Remark 3.6
  • Definition 3.7: $\mathsf{P}$, Definition 1.20 on page 9 of ab09
  • Remark 3.9
  • Definition 3.10: $\mathsf{L}$-uniformity, Definition 6.5 on page 104 of ab09
  • Definition 3.11: $\mathsf{DLOGTIME}$-uniformity, Definition 4.28 on page 123 of bi94
  • ...and 50 more