Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff
Patrick Pynadath, Ruqi Zhang
TL;DR
This paper investigates why any-order autoregressive models (AO-ARMs) benefit from two-stream attention and identifies a structural-semantic tradeoff in single-stream causal attention: tokens informative for predicting the next token and tokens that provide a complete history may be misaligned under a random permutation $\\pi$. To isolate this tradeoff from position-content separation, the authors propose Decoupled RoPE, which rotates keys by the lagging semantic position and queries by the leading semantic position, providing target position information without revealing content. Experiments on a small text8 setup show that Decoupled RoPE is competitive with masked diffusion at short sequence lengths ($n=128$) but degrades at longer lengths ($n=1024$), consistent with the tradeoff being more severe as semantically adjacent tokens become structurally distant. The results suggest that the empirical success of two-stream attention may stem from its ability to bypass this deeper structural-semantic tension, guiding future efficient AO-ARM designs that balance prediction and global summarization without exposing content.
Abstract
Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of decoupling token content from position. In this work, we argue that two-stream attention may be serving a more subtle role. We identify a structural-semantic tradeoff in any-order generation: the hidden representation at each step must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, objectives that compete for attention capacity in a single stream but can specialize across two streams. To isolate this tradeoff from position-content separation, we propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. Decoupled RoPE performs competitively at short sequence lengths--where semantic and structural proximity coincide--but degrades as sequence length increases and the two orderings diverge. These results suggest that the success of two-stream attention stems not merely from separating position from content, but from circumventing the deeper structural-semantic tradeoff inherent to any-order generation.
