Table of Contents
Fetching ...

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

Jingtao Wang, Yucong Wang, Jun Ding, Rui Cai, Xun Wang

TL;DR

This work proposes ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention.

Abstract

Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

TL;DR

This work proposes ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention.

Abstract

Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.
Paper Structure (36 sections, 9 equations, 3 figures, 2 tables)

This paper contains 36 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: ARACH overview.(a)Summarize-then-generate intuition. At each decoding step, an index-aligned context hub stream$\{c_i\}$ aggregates the causally available prefix and supports next-token prediction alongside the verbal token stream $\{x_i\}$. (b) Hub-based attention routing realized by a two-stream attention layout. Self-attention is partitioned into four blocks under hub-specific visibility constraints. A scalar logit offset $b$ is applied to regulate the strength of hub-mediated attention reallocation at inference time.
  • Figure 2: Attention sink and hub-mediated reallocation analyses on PG-19, comparing the baseline and ARACH (default logit offset setting). (a) Layerwise sink score, defined as the mean attention mass assigned to the first verbal token, averaged over heads and samples in the test set. $L^*$ denotes the layer with the largest sink score under the baseline. (b,c) Heatmaps of mean attention weights among verbal tokens at layer $L^*$ for the baseline and ARACH . The red boxes highlight early verbal-token columns that are most associated with sink-like concentration in the baseline. For readability, only the first and last $K=64$ tokens are shown. (d) Layerwise attention-mass decomposition, reporting the fraction of attention assigned to the first verbal token, to hub tokens, and to the remaining verbal tokens (averaged over samples in the test set and heads). (e) Routing summary at layer $L^*$ for ARACH, reporting the fraction of attention mass allocated to each attention block.
  • Figure 3: Sensitivity to the hub-attention logit offset $b$ across evaluations (sweep from $0$ to $-1.0$). Curves show the task metric versus $b$; the dashed horizontal line is the baseline and the red dashed vertical line marks the default $b=-0.5$. The blue area indicates values of $b$ for which all tasks improve over the baseline in our sweep. (a) LAMBADA (Accuracy$\uparrow$). (b) StoryCloze (Accuracy$\uparrow$). (c) SQuAD (F1$\uparrow$). (d) WikiText-103 (Perplexity$\downarrow$). (e) PG-19 (Perplexity$\downarrow$). (f) SQuAD (Exact Match$\uparrow$).