Table of Contents
Fetching ...

Better Process Supervision with Bi-directional Rewarding Signals

Wenxiang Chen, Wei He, Zhiheng Xi, Honglin Guo, Boyang Hong, Jiazheng Zhang, Rui Zheng, Nijun Li, Tao Gui, Yun Li, Qi Zhang, Xuanjing Huang

TL;DR

BiRM introduces a bidirectional process supervision framework for LLM reasoning by combining backward correctness signals with forward-looking success probabilities, inspired by the A* algorithm. It adds a value head to estimate the likelihood of reaching the correct final answer from a partial trajectory, complementing the traditional reward head. Through extensive experiments on GSM8K, MATH-500, and Gaokao2023 across multiple base models, BiRM achieves notable gains over PRMs and ORMs in both Best-of-N sampling and beam-search settings, including a $3.1\%$ Gaokao2023 improvement over PRM under Best-of-N. Analyses reveal BiRM’s improved guidance, robustness to scaling, and orthogonality to existing supervision methods, with insights into value-label annotation strategies and training data scaling. The work highlights bidirectional supervision as a practical, scalable enhancement for LLM-based mathematical reasoning and trajectory search.

Abstract

Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.

Better Process Supervision with Bi-directional Rewarding Signals

TL;DR

BiRM introduces a bidirectional process supervision framework for LLM reasoning by combining backward correctness signals with forward-looking success probabilities, inspired by the A* algorithm. It adds a value head to estimate the likelihood of reaching the correct final answer from a partial trajectory, complementing the traditional reward head. Through extensive experiments on GSM8K, MATH-500, and Gaokao2023 across multiple base models, BiRM achieves notable gains over PRMs and ORMs in both Best-of-N sampling and beam-search settings, including a Gaokao2023 improvement over PRM under Best-of-N. Analyses reveal BiRM’s improved guidance, robustness to scaling, and orthogonality to existing supervision methods, with insights into value-label annotation strategies and training data scaling. The work highlights bidirectional supervision as a practical, scalable enhancement for LLM-based mathematical reasoning and trajectory search.

Abstract

Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.

Paper Structure

This paper contains 41 sections, 11 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Error-detection accuracy across different steps, where step 1 and steps beyond 15 are truncated for better visualization. We evaluate the process reward model (PRM), value model (VM), and BiRM on PRMBench.
  • Figure 2: An example of our proposed BiRM compared with traitional Process Reward Models (PRMs). Given a question $q$, PRMs only consider the accumulated rewards up to the current step. In contrast, BiRM takes into account two aspects: the correctness rewards received so far and the probability of reaching correct final answers.
  • Figure 3: Scaling decline phenomenon in Best-of-N sampling. We present the BoN accuracy results across five random seeds. For better visualization, we apply the moving average with a window size of $10$.
  • Figure 4: Performance comparison of ORM, PRM and BiRM under BoN sampling. The base models are open-source RLHFlow-8B-Deepseek-Data and RLHFlow-8B-Mistral-Data xiong2024rlhflowmath. We follow Equation \ref{['eq:BiRM-eval']} to calculate the BiRM score at test-time.
  • Figure 5: The prompt template for MetaMath dataset preprocessing.
  • ...and 1 more figures