DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs

Xi Ye; Wuwei Zhang; Fangcong Yin; Howard Yen; Danqi Chen

DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs

Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen

TL;DR

DySCO leverages retrieval heads--a subset of attention heads specialized for long-context retrieval--to identify task-relevant tokens at each decoding step and explicitly up-weight them and dynamically adjusts attention during generation to better utilize relevant context.

Abstract

Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work, we propose DySCO, a novel decoding algorithm for improving long-context reasoning. DySCO leverages retrieval heads--a subset of attention heads specialized for long-context retrieval--to identify task-relevant tokens at each decoding step and explicitly up-weight them. By doing so, DySCO dynamically adjusts attention during generation to better utilize relevant context. The method is training-free and can be applied directly to any off-the-shelf LMs. Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modest additional compute. Further analysis highlights the importance of both dynamic attention rescaling and retrieval-head-guided selection for the effectiveness of the method, while providing interpretability insights into decoding-time attention behavior. Our code is available at https://github.com/princeton-pli/DySCO.

DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs

TL;DR

Abstract

Paper Structure (57 sections, 7 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 57 sections, 7 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Background and Motivation
Preliminaries: Retrieval Heads
Retrieval Heads.
Query-Focused Retrieval Heads (QRHead).
Retrieval Heads Stay Focused on Relevant Context
Diagnostic task: Path Traversal.
Severe performance degradation on Path Traversal.
Behavior of retrieval heads.
Steering overall attention with retrieval heads.
DySCO: Dynamic Attention Scaling
Overview
The DySCO Algorithm
Aggregation.
Selection.
...and 42 more sections

Figures (6)

Figure 1: Top: An illustrative Path Traversal task (simplified). Solving the task requires dynamically locating relevant context during decoding. Bottom: Accuracy as a function of context length for models with and without DySCO. Despite the total context being only 16K tokens, both models exhibit severe performance degradation as context length increases.
Figure 2: Overview of DySCO algorithm. At each decoding step, DySCO consists of three stages: (1) Aggregation: We run a partial forward pass over the input sequence to obtain attentions of retrieval heads, such as QRHead, and use them to assign relevance scores to context tokens; (2) Selection: We use the relevance scores to select the important tokens; (3) Rescaling: We up-weight the important tokens by intervening attention logits of all attention heads and run a full forward pass to sample the next token.
Figure 3: Left: Performance of Qwen3-8B on Path Traversal as context length increases. Middle: Fractions that the gold edge appear among the top-5% edges ranked by attention score (sum of attention over all tokens in the span) from QRHead versus random heads. Right: Attention mass assigned to gold edges by QRHead and random heads. Despite severe performance degradation and a reduction in attention mass on gold edges, QRHead consistently allocates substantially higher attention to the gold edges.
Figure 4: Performance on MRCR, LongBenchV2, and Clipper. DySCO substantially outperforms vanilla decoding, and UniAttnS. YaRN is applied to Qwen models at 128K context length, but not to Llama-3.1-8B-Instruct, which natively supports 128K.
Figure 5: Comparison between DySCO, RAG (Stella), LongLLMLingua, and vanilla decoding. For RAG and LongLLMLingua, we report the results after reducing the context to different length (4K, 8K, and 16K tokens).
...and 1 more figures

DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs

TL;DR

Abstract

DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)