Table of Contents
Fetching ...

Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers

Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Mingda Wan

TL;DR

The paper analyzes the theoretical expressivity of Tensor Attention and RoPE-based Tensor Attention Transformers through circuit complexity, showing that, unless $\mathsf{TC}^0=\mathsf{NC}^1$, constant-depth, poly(n)-size implementations with $d=O(n)$ cannot solve fixed membership or $(A_{F,r})^*$ closure problems, while remaining simulable by $\mathsf{DLOGTIME}$-uniform $\mathsf{TC}^0$ circuits. It provides detailed complexity bounds for matrix operations, tensor ops, RoPE components, and full multi-layer transformers, demonstrating TC^0-simulability under certain parameter regimes and establishing NC^1-hardness implications for the related decision problems. The hardness results bridge theory and practice by highlighting fundamental expressivity constraints that accompany the empirical success of these architectures. The work offers a principled framework to guide future transformer design toward models with stronger theoretical guarantees, while acknowledging that forward-computation analyses with fixed activations may not capture training dynamics or alternative encodings. Overall, the paper clarifies a notable gap between practical performance and computational-l-theoretic limits in high-order, positional-encoded attention mechanisms.

Abstract

Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ($\mathsf{RoPE}$) has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or $(A_{F,r})^*$ closure problems, under the assumption that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between the empirical performance and theoretical constraints of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that could guide the development of more theoretically grounded approaches to Transformer model design and scaling.

Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers

TL;DR

The paper analyzes the theoretical expressivity of Tensor Attention and RoPE-based Tensor Attention Transformers through circuit complexity, showing that, unless , constant-depth, poly(n)-size implementations with cannot solve fixed membership or closure problems, while remaining simulable by -uniform circuits. It provides detailed complexity bounds for matrix operations, tensor ops, RoPE components, and full multi-layer transformers, demonstrating TC^0-simulability under certain parameter regimes and establishing NC^1-hardness implications for the related decision problems. The hardness results bridge theory and practice by highlighting fundamental expressivity constraints that accompany the empirical success of these architectures. The work offers a principled framework to guide future transformer design toward models with stronger theoretical guarantees, while acknowledging that forward-computation analyses with fixed activations may not capture training dynamics or alternative encodings. Overall, the paper clarifies a notable gap between practical performance and computational-l-theoretic limits in high-order, positional-encoded attention mechanisms.

Abstract

Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding () has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and -based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or closure problems, under the assumption that . These findings highlight a gap between the empirical performance and theoretical constraints of Tensor Attention and -based Tensor Attention Transformers, offering insights that could guide the development of more theoretically grounded approaches to Transformer model design and scaling.

Paper Structure

This paper contains 29 sections, 24 theorems, 23 equations.

Key Result

Lemma 3.4

If integer $0 < p \leq \mathop{\mathrm{poly}}\nolimits(n)$, then we say the conditions below are satisfied:

Theorems & Definitions (71)

  • Definition 3.1: Float point number, Definition 9 from chi24
  • Definition 3.2: Rounding, Definition 9 from chi24
  • Definition 3.3: Float point operations, Lemma 10 from chi24
  • Lemma 3.4: Float point operations in $\mathsf{TC}^0$, Lemma 10 and Lemma 11 from chi24
  • Corollary 3.5: Floor operation in $\mathsf{TC}^0$, Corollary 3.17 from cll+24_rope
  • Lemma 3.6: Computing $\exp$ in $\mathsf{TC}^0$, Lemma 12 from chi24
  • Lemma 3.7: Computing square root in $\mathsf{TC}^0$, Lemma 12 from chi24
  • Definition 3.8: Boolean Circuit, Definition 6.1 from ab09
  • Definition 3.9: Languages, Definition 6.2 from ab09
  • Definition 3.10: $\mathsf{NC}^i$, Definition 6.21 from ab09
  • ...and 61 more