Table of Contents
Fetching ...

ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

Shuaiyi Nie, Siyu Ding, Wenyuan Zhang, Linhao Yu, Tianmeng Yang, Yao Chen, Tingwen Liu, Weichong Yin, Yu Sun, Hua Wu

TL;DR

AttnPO addresses overthinking in large reasoning models by leveraging internal attention signals to provide fine-grained, stepwise credit assignment. It identifies a small set of Key-Focus Heads (KFHs) that preferentially attend to essential reasoning steps and introduces Stepwise Advantage Rescaling with two strategies—Pos-Adv Attenuation for redundant steps and Neg-Adv Attenuation for essential steps—to reduce unnecessary reasoning while preserving correctness. The approach is designed to be low overhead, requiring no additional reward models or data beyond existing signals, and it achieves substantial reductions in reasoning length with improved accuracy across multiple benchmarks and model scales. Empirical results demonstrate strong efficiency–performance gains and robust generalization, highlighting AttnPO's potential for practical, scalable efficient reasoning.

Abstract

Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.

ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

TL;DR

AttnPO addresses overthinking in large reasoning models by leveraging internal attention signals to provide fine-grained, stepwise credit assignment. It identifies a small set of Key-Focus Heads (KFHs) that preferentially attend to essential reasoning steps and introduces Stepwise Advantage Rescaling with two strategies—Pos-Adv Attenuation for redundant steps and Neg-Adv Attenuation for essential steps—to reduce unnecessary reasoning while preserving correctness. The approach is designed to be low overhead, requiring no additional reward models or data beyond existing signals, and it achieves substantial reductions in reasoning length with improved accuracy across multiple benchmarks and model scales. Empirical results demonstrate strong efficiency–performance gains and robust generalization, highlighting AttnPO's potential for practical, scalable efficient reasoning.

Abstract

Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.
Paper Structure (51 sections, 11 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 51 sections, 11 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: AttnPO vs. other reinforcement learning methods for efficient reasoning.
  • Figure 2: Probing results of Key-Focus Heads.
  • Figure 3: The overall framework of AttnPO.
  • Figure 4: Training dynamics of TLMRE and AttnPO on 1.5B scale on AIME2024 with 16 sampling runs.
  • Figure 5: Ablation of strategies & hyperparameters; purple bars show Tok., red line shows Acc..
  • ...and 10 more figures