The Bidirectional Process Reward Model
Lingyin Zhang, Jun Gao, Xiaoxue Ren, Ziqiang Cao
TL;DR
This paper tackles the limited global context of traditional process reward models by introducing BiPRM, a bidirectional evaluation framework that runs a parallel backward (R2L) stream alongside the conventional forward (L2R) stream and fuses their rewards with a dynamic gate. By reversing the reasoning trajectory through prompt reversal and adaptively weighting forward and backward signals, BiPRM achieves improved step- and trajectory-level verification with only a small gating-parameter overhead and a modest ~5% increase in inference latency. Across GSM-Plus, MATH500, and ProcessBench benchmarks and multiple backbones and objectives, BiPRM yields consistent, significant gains in solution ranking (average around 10–11%) and step-level error localization (average around 37.7%), demonstrating robustness and general applicability. The work highlights the importance of bidirectional context for process-based supervision and suggests a practical, scalable direction for enhancing reasoning in large language models.
Abstract
Process Reward Models (PRMs), which assign fine-grained scores to intermediate reasoning steps within a solution trajectory, have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs). However, most existing PRMs rely on a unidirectional left-to-right (L2R) evaluation scheme, which restricts their utilization of global context. In light of this challenge, we propose a novel bidirectional evaluation paradigm, named Bidirectional Process Reward Model (BiPRM). BiPRM incorporates a parallel right-to-left (R2L) evaluation stream, implemented via prompt reversal, alongside the conventional L2R flow. Then a gating mechanism is introduced to adaptively fuse the reward scores from both streams to yield a holistic quality assessment. Remarkably, compared to the original PRM, BiPRM introduces only a 0.3% parameter increase for the gating module, and the parallel execution of two streams incurs merely 5% inference time latency. Our extensive empirical evaluations spanning diverse benchmarks, LLM backbones, PRM objectives and sampling policies demonstrate that BiPRM consistently surpasses unidirectional baselines, achieving an average relative gain of 10.6% over 54 solution-level configurations and 37.7% in 12 step-level error detection scenarios. Generally, our results highlight the effectiveness, robustness and general applicability of BiPRM, offering a promising new direction for process-based reward modeling.
