Table of Contents
Fetching ...

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

Runze Liu, Jiakang Wang, Yuling Shi, Zhihui Xie, Chenxin An, Kaiyan Zhang, Jian Zhao, Xiaodong Gu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai

TL;DR

AttnRL tackles inefficiencies in process-supervised RL for LLM reasoning by introducing attention-based branching (ATB) to exploit high-impact steps, an adaptive sampling framework (ADS) that focuses on difficult problems and maintains non-zero training signals, and a one-step off-policy training pipeline to reduce redundant sampling. The method hinges on Forward Context Influence (FCI) to identify influential steps through step-level attention, enabling targeted branching and more informative Monte Carlo credits. Empirical results on six challenging mathematical reasoning benchmarks show AttnRL consistently outperforming both OSRL and PSRL baselines, with notable gains in performance and training efficiency (e.g., average +$7.5\%$ over baselines at $1.5$B scale and up to $8\%$ faster training). These contributions advance scalable, efficient reasoning enhancements for LLMs, reducing computational overhead while improving reasoning accuracy.

Abstract

Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

TL;DR

AttnRL tackles inefficiencies in process-supervised RL for LLM reasoning by introducing attention-based branching (ATB) to exploit high-impact steps, an adaptive sampling framework (ADS) that focuses on difficult problems and maintains non-zero training signals, and a one-step off-policy training pipeline to reduce redundant sampling. The method hinges on Forward Context Influence (FCI) to identify influential steps through step-level attention, enabling targeted branching and more informative Monte Carlo credits. Empirical results on six challenging mathematical reasoning benchmarks show AttnRL consistently outperforming both OSRL and PSRL baselines, with notable gains in performance and training efficiency (e.g., average + over baselines at B scale and up to faster training). These contributions advance scalable, efficient reasoning enhancements for LLMs, reducing computational overhead while improving reasoning accuracy.

Abstract

Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.

Paper Structure

This paper contains 45 sections, 12 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: An Illustration of AttnRL. (a) AttnRL branches at steps with high attention scores. (b) AttnRL outperforms the baselines with great efficiency.
  • Figure 2: The visualization of steps with high FCI scores.
  • Figure 3: Disruption results on AIME24 and AIME25. (a) Normalized average accuracy of different disruption types. (b) Average accuracy of different disruption positions.
  • Figure 4: Average FCI scores of all problems during the training process of TreeRL on DeepScaleR dataset.
  • Figure 5: Training pipeline of AttnRL. Our method (left) only needs one-time generation per training iteration, while previous methods (right) require to sample twice and are inefficient.
  • ...and 4 more figures