Table of Contents
Fetching ...

Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning

Zhaohui Yang, Chenghua He, Xiaowen Shi, Linjing Li, Qiyue Yin, Shihong Deng, Daxin Jiang

TL;DR

This work tackles the challenge of evaluating mathematical reasoning with long chain-of-thought by introducing two reflection-aware rules, Error Propagation and Error Cessation, to capture both erroneous and corrective steps in reasoning. It deploys an LLM-based judger to annotate step-level correctness, producing 1.7 million labeled samples to train a 7B Process Reward Model (PRM) that outperforms open-source baselines and MC-based annotation methods on both solution- and step-level metrics. The PRM leverages a binary cross-entropy loss over per-step labels, $L_{PRM} = \sum_{i=0}^{K} \hat{y}_{i}\log y_i + (1-\hat{y}_i)\log(1-y_{i})$, with labels derived from reflection-aware judgments. Experimental results on MATH500 and AIME24 show superior performance in PRM@64 and PRM@8-step, as well as robust generalization to out-of-distribution tasks like OBen, indicating practical benefits for guided search and reflective mathematical reasoning in LLMs.

Abstract

Many studies focus on data annotation techniques for training effective PRMs. However, current methods encounter a significant issue when applied to long CoT reasoning processes: they tend to focus solely on the first incorrect step and all preceding steps, assuming that all subsequent steps are incorrect. These methods overlook the unique self-correction and reflection mechanisms inherent in long CoT, where correct reasoning steps may still occur after initial reasoning mistakes. To address this issue, we propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process. Given that under the reflection pattern, correct and incorrect steps often alternate, we introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps. Leveraging an LLM-based judger for annotation, we collect 1.7 million data samples to train a 7B PRM and evaluate it at both solution and step levels. Experimental results demonstrate that compared to existing open-source PRMs and PRMs trained on open-source datasets, our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores. Compared to widely used MC-based annotation methods, our annotation approach not only achieves higher data efficiency but also delivers superior performance. Detailed analysis is also conducted to demonstrate the stability and generalizability of our method.

Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning

TL;DR

This work tackles the challenge of evaluating mathematical reasoning with long chain-of-thought by introducing two reflection-aware rules, Error Propagation and Error Cessation, to capture both erroneous and corrective steps in reasoning. It deploys an LLM-based judger to annotate step-level correctness, producing 1.7 million labeled samples to train a 7B Process Reward Model (PRM) that outperforms open-source baselines and MC-based annotation methods on both solution- and step-level metrics. The PRM leverages a binary cross-entropy loss over per-step labels, , with labels derived from reflection-aware judgments. Experimental results on MATH500 and AIME24 show superior performance in PRM@64 and PRM@8-step, as well as robust generalization to out-of-distribution tasks like OBen, indicating practical benefits for guided search and reflective mathematical reasoning in LLMs.

Abstract

Many studies focus on data annotation techniques for training effective PRMs. However, current methods encounter a significant issue when applied to long CoT reasoning processes: they tend to focus solely on the first incorrect step and all preceding steps, assuming that all subsequent steps are incorrect. These methods overlook the unique self-correction and reflection mechanisms inherent in long CoT, where correct reasoning steps may still occur after initial reasoning mistakes. To address this issue, we propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process. Given that under the reflection pattern, correct and incorrect steps often alternate, we introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps. Leveraging an LLM-based judger for annotation, we collect 1.7 million data samples to train a 7B PRM and evaluate it at both solution and step levels. Experimental results demonstrate that compared to existing open-source PRMs and PRMs trained on open-source datasets, our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores. Compared to widely used MC-based annotation methods, our annotation approach not only achieves higher data efficiency but also delivers superior performance. Detailed analysis is also conducted to demonstrate the stability and generalizability of our method.

Paper Structure

This paper contains 41 sections, 1 equation, 10 figures, 11 tables.

Figures (10)

  • Figure 1: The overall framework of our method.
  • Figure 2: PRM@N of Qwen2.5-7B-SFT$^{*}$ using PRMs trained on data annotated by MC-based and our method.
  • Figure 3: We categorize 1,000 solutions into 10 equal-sized bins based on their step counts, with Bin 1 containing solutions with the fewest steps and Bin 10 containing those with the most steps. Within each bin, we calculate the proportion of steps where both completion models assign identical hard labels.
  • Figure 4: Distribution of the number of steps in each solution and the number of tokens contained in each step across different datasets. We randomly select 1,000 samples from each dataset for statistical analysis.
  • Figure 5: An Example of solution reformation (part 1).
  • ...and 5 more figures