Nexus: Higher-Order Attention Mechanisms in Transformers

Hanting Chen; Chong Zhu; Kai Han; Yuchuan Tian; Yuchen Liang; Tianyu Guo; Xinghao Chen; Dacheng Tao; Yunhe Wang

Nexus: Higher-Order Attention Mechanisms in Transformers

Hanting Chen, Chong Zhu, Kai Han, Yuchuan Tian, Yuchen Liang, Tianyu Guo, Xinghao Chen, Dacheng Tao, Yunhe Wang

TL;DR

Nexus tackles the expressivity limits of standard self-attention by introducing a recursive higher-order attention mechanism that refines Query and Key representations through inner attentions. A weight-sharing strategy keeps the parameter budget at O(1) with respect to recursive depth, while enabling multi-hop, hierarchical reasoning within a single layer. The authors provide complexity analyses and theoretical arguments showing the approach can overcome the linear bottleneck of traditional attention, complemented by empirical gains across multiple benchmarks and ablations. Additionally, Nexus demonstrates practical value by retrofitting pre-trained LLMs to improve complex reasoning tasks with modest computational overhead, suggesting a viable path for upgrading existing models.

Abstract

Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the Nexus, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Nexus dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $\mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Nexus outperforms standard Transformers on multiple benchmarks.

Nexus: Higher-Order Attention Mechanisms in Transformers

TL;DR

Abstract

Nexus: Higher-Order Attention Mechanisms in Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (3)