Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff

Patrick Pynadath; Ruqi Zhang

Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff

Patrick Pynadath, Ruqi Zhang

TL;DR

This paper investigates why any-order autoregressive models (AO-ARMs) benefit from two-stream attention and identifies a structural-semantic tradeoff in single-stream causal attention: tokens informative for predicting the next token and tokens that provide a complete history may be misaligned under a random permutation $\\pi$. To isolate this tradeoff from position-content separation, the authors propose Decoupled RoPE, which rotates keys by the lagging semantic position and queries by the leading semantic position, providing target position information without revealing content. Experiments on a small text8 setup show that Decoupled RoPE is competitive with masked diffusion at short sequence lengths ($n=128$) but degrades at longer lengths ($n=1024$), consistent with the tradeoff being more severe as semantically adjacent tokens become structurally distant. The results suggest that the empirical success of two-stream attention may stem from its ability to bypass this deeper structural-semantic tension, guiding future efficient AO-ARM designs that balance prediction and global summarization without exposing content.

Abstract

Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of decoupling token content from position. In this work, we argue that two-stream attention may be serving a more subtle role. We identify a structural-semantic tradeoff in any-order generation: the hidden representation at each step must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, objectives that compete for attention capacity in a single stream but can specialize across two streams. To isolate this tradeoff from position-content separation, we propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. Decoupled RoPE performs competitively at short sequence lengths--where semantic and structural proximity coincide--but degrades as sequence length increases and the two orderings diverge. These results suggest that the success of two-stream attention stems not merely from separating position from content, but from circumventing the deeper structural-semantic tradeoff inherent to any-order generation.

Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff

TL;DR

. To isolate this tradeoff from position-content separation, the authors propose Decoupled RoPE, which rotates keys by the lagging semantic position and queries by the leading semantic position, providing target position information without revealing content. Experiments on a small text8 setup show that Decoupled RoPE is competitive with masked diffusion at short sequence lengths (

) but degrades at longer lengths (

), consistent with the tradeoff being more severe as semantically adjacent tokens become structurally distant. The results suggest that the empirical success of two-stream attention may stem from its ability to bypass this deeper structural-semantic tension, guiding future efficient AO-ARM designs that balance prediction and global summarization without exposing content.

Abstract

Paper Structure (17 sections, 5 equations, 2 figures)

This paper contains 17 sections, 5 equations, 2 figures.

Introduction
Background
Any-Order Autoregression
Two Stream Attention
The Structural-Semantic Tradeoff
Example.
Dependence on sequence length.
Decoupled RoPE
Rotary Position Embeddings
Decoupling Keys and Queries
What Decoupled RoPE Solves and What It Cannot
Experiments
Setup
Competitive Performance at Short Sequence Lengths
Degradation at Longer Sequence Lengths
...and 2 more sections

Figures (2)

Figure 1: Coherence-diversity frontiers for MDLM and D-RoPE at sequence lengths 128 (left) and 1024 (right). Coherence is measured as the fraction of valid words with four or more characters; diversity is the fraction of unique words. At length 128, the frontiers largely overlap. At length 1024, a clear gap emerges, indicating that D-RoPE's degradation is concentrated in longer words that require global context.
Figure 2: Validation NLL over training for MDLM (left) and D-RoPE (right) at sequence lengths 128 and 1024. For MDLM, the gap between lengths is modest. For D-RoPE, the 1024 curve converges to a substantially worse value, confirming that the length-dependent degradation is specific to the any-order autoregressive parameterization.

Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff

TL;DR

Abstract

Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff

Authors

TL;DR

Abstract

Table of Contents

Figures (2)