Table of Contents
Fetching ...

Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning

Jiachen Zhu, Congmin Zheng, Jianghao Lin, Kounianhua Du, Ying Wen, Yong Yu, Jun Wang, Weinan Zhang

TL;DR

This work tackles out-of-distribution challenges in Process Reward Models for mathematical reasoning by introducing RetrievalPRM, a two-stage retrieval framework that fetches semantically similar questions and step-level references to warm up PRM prompts. The method integrates Question-level and Step-level retrieval with a retrieval-based system prompt, improving generalization across model types, sizes, and problem domains. Extensive experiments on four real-world datasets show RetrievalPRM-7B achieving state-of-the-art performance among open-source PRMs and surpassing many language-model critics, particularly on Olympiad-level problems. The authors also provide an open-source retrieval-enhanced dataset and tuning framework, highlighting practical impact for robust, explainable multi-step reasoning.

Abstract

While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies key OOD issues, including step OOD, caused by differences in reasoning patterns across model types and sizes, and question OOD, which arises from dataset shifts between training data and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup, enhancing PRM's ability to evaluate target steps and improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetrievalPRM model, establishing a new standard for PRM performance.

Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning

TL;DR

This work tackles out-of-distribution challenges in Process Reward Models for mathematical reasoning by introducing RetrievalPRM, a two-stage retrieval framework that fetches semantically similar questions and step-level references to warm up PRM prompts. The method integrates Question-level and Step-level retrieval with a retrieval-based system prompt, improving generalization across model types, sizes, and problem domains. Extensive experiments on four real-world datasets show RetrievalPRM-7B achieving state-of-the-art performance among open-source PRMs and surpassing many language-model critics, particularly on Olympiad-level problems. The authors also provide an open-source retrieval-enhanced dataset and tuning framework, highlighting practical impact for robust, explainable multi-step reasoning.

Abstract

While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies key OOD issues, including step OOD, caused by differences in reasoning patterns across model types and sizes, and question OOD, which arises from dataset shifts between training data and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup, enhancing PRM's ability to evaluate target steps and improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetrievalPRM model, establishing a new standard for PRM performance.

Paper Structure

This paper contains 33 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The distribution differences across three datasets: GSM8K, MATH and Olympiad. We use sentence-bert to encode these questions and perform t-sne visualization.
  • Figure 2: Processes and problem-solving ideas for the same question vary from different models with the perspectives of model types and model sizes. GPT tends to analyze and calculate, while Qwen-72B tends to solve equations. Qwen-1.5B is small and relatively weak. It can only enumerate, and its thinking chain is short, so its answers are also very wrong.
  • Figure 3: The model structure of our proposed RetrievalPRM framework and its difference with traditional PRM. We design a Two-stage Retrieval Module to retrieve reference questions and steps in each stage.
  • Figure 4: We show the F1 scores of Retrieval-PRM on four datasets and their average, as the number of retrieval questions varies. Specifically, Top-0 means no retrieval questions.
  • Figure 5: The illustration of PRM input template.
  • ...and 1 more figures