Table of Contents
Fetching ...

Nexus: Higher-Order Attention Mechanisms in Transformers

Hanting Chen, Chong Zhu, Kai Han, Yuchuan Tian, Yuchen Liang, Tianyu Guo, Xinghao Chen, Dacheng Tao, Yunhe Wang

TL;DR

Nexus tackles the expressivity limits of standard self-attention by introducing a recursive higher-order attention mechanism that refines Query and Key representations through inner attentions. A weight-sharing strategy keeps the parameter budget at O(1) with respect to recursive depth, while enabling multi-hop, hierarchical reasoning within a single layer. The authors provide complexity analyses and theoretical arguments showing the approach can overcome the linear bottleneck of traditional attention, complemented by empirical gains across multiple benchmarks and ablations. Additionally, Nexus demonstrates practical value by retrofitting pre-trained LLMs to improve complex reasoning tasks with modest computational overhead, suggesting a viable path for upgrading existing models.

Abstract

Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the Nexus, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Nexus dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $\mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Nexus outperforms standard Transformers on multiple benchmarks.

Nexus: Higher-Order Attention Mechanisms in Transformers

TL;DR

Nexus tackles the expressivity limits of standard self-attention by introducing a recursive higher-order attention mechanism that refines Query and Key representations through inner attentions. A weight-sharing strategy keeps the parameter budget at O(1) with respect to recursive depth, while enabling multi-hop, hierarchical reasoning within a single layer. The authors provide complexity analyses and theoretical arguments showing the approach can overcome the linear bottleneck of traditional attention, complemented by empirical gains across multiple benchmarks and ablations. Additionally, Nexus demonstrates practical value by retrofitting pre-trained LLMs to improve complex reasoning tasks with modest computational overhead, suggesting a viable path for upgrading existing models.

Abstract

Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the Nexus, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Nexus dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Nexus outperforms standard Transformers on multiple benchmarks.

Paper Structure

This paper contains 27 sections, 2 theorems, 23 equations, 2 figures, 3 tables.

Key Result

Theorem 3.1

(Linear Bottleneck) (1) Given any $N$ different inputs $X_m \in \mathbb{R}^{n \times d}, m=1, \dots, N$ and the corresponding target row stochastic matrices $A_m \in \mathbb{R}^{n \times n}$, as long as $rank(\log(A_m)) \leq d_k$, there always exists two mappings $Q, K: \mathbb{R}^{n \times d} \righ (2) If $d<n-1$, there exist $A_m \in \mathbb{R}^{n \times n}$ that satisfies $rank(\log(A_m))=1$ bu

Figures (2)

  • Figure 1: Overview of the Nexus. The figure illustrates the hierarchical structure of our proposed mechanism. Left: The integration of the Nexus layer within a standard Transformer block, replacing the conventional self-attention module. Middle: The detailed architecture of the 2nd-Order Attention mechanism. Unlike standard attention where $Q$ and $K$ are linear projections, Nexus recursively refines $Q$ and $K$ through inner self-attention loops (MatMul $\rightarrow$ SoftMax $\rightarrow$ MatMul) before the final attention computation. This allows the model to capture intricate dependencies prior to the main interaction. Right: The generalized Recursive Attention framework, demonstrating how the mechanism can be extended to arbitrary orders ($m$-th order) to model deeper hierarchical relationships.
  • Figure 2: Visualization of average attention heatmaps. The x-axis represents Key positions, and the y-axis represents Query positions. Brighter colors indicate higher attention weights. (a) Standard Self-Attention in Pythia. (b) The outer main attention of the Nexus network. (c) The inner recursive attention used to project Queries. (d) The inner recursive attention used to project Keys.

Theorems & Definitions (3)

  • Theorem 3.1
  • Theorem 1.1
  • proof