Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

Rian Atri

Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

Rian Atri

TL;DR

The results suggest that length aware priors and late phase gain control preserve scarce improvements, especially in long span, noisy logit regimes, while keeping test time costs effectively unchanged.

Abstract

We study efficient reasoning under tight compute. We ask how to make structured, correct decisions without increasing test time cost. We add two training only components to small and medium Transformers that also transfer to broader differentiable optimizers. First, a length aware attention prior built via fuzzy regime position alignment, RPA, yields a normalized pre softmax bias that guides attention like a structured regularizer while adding no new inference parameters. Second, a minimal gain aware controller, Guardian, nudges attention sharpness only when validation improvements warrant it, following a two timescale policy gradient view of nonconvex optimization. It is disabled at inference. A KL perspective shows softmax of z plus log pi as MAP with KL regularization, grounding the prior in a principled objective. Under strict compute parity on WikiText 2, we reduce validation cross entropy while matching baseline latency and memory. At inference, we add a precomputed, cached prior B of T as a single additive bias per head. The controller does not run. In practice, this incurs negligible overhead, a cached bias add per head, with no measurable p50 latency shift. Our results suggest that length aware priors and late phase gain control preserve scarce improvements, especially in long span, noisy logit regimes, while keeping test time costs effectively unchanged.

Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

TL;DR

Abstract

Paper Structure (63 sections, 4 theorems, 8 equations, 1 figure, 6 tables, 1 algorithm)

This paper contains 63 sections, 4 theorems, 8 equations, 1 figure, 6 tables, 1 algorithm.

Introduction
Contributions
Scope
Attention with a Prior is KL-Regularized MAP
Implications
Method: Fuzzy Regime Prior for Attention (RPA)
Fuzzy regimes (intuition)
Fuzzy regimes
Length-aware positional basis
Entropic alignment and prior
Motivation
Core code (Gaussian memberships)
Core code (RPA bias)
Gain-Aware Control and Late-Phase Optimization
Design choices
...and 48 more sections

Key Result

Theorem 1

Define $a^\pi(z)=\mathrm{softmax}(z+\log\pi)$. Then with a unique maximizer.

Figures (1)

Figure 1: Training dynamics under fixed compute (a) Validation cross-entropy and its smoothed rate of change with phase bands ( +RPA align, +Guardian, +SWA-select ). (b) Guardian’s $\tau_{\text{att}}$ adapts cautiously, avoiding over-tightening. (c) $H(\mu)$ rises then stabilizes, indicating non-collapsed, informative regimes.

Theorems & Definitions (7)

Theorem 1: KL-regularized MAP
Theorem 2: Projected two-timescale convergence
Lemma 1: One-step expected improvement
Remark 1: Mapping to our implementation
Proposition 1: Row-sum and practical normalization
proof : Proof sketch of Theorem \ref{['thm:guardian-converge']}
proof : Proof of Lemma \ref{['lem:guardian-onestep']}

Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

TL;DR

Abstract

Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (7)