Table of Contents
Fetching ...

Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Qiyuan Liu, Hao Xu, Xuhong Chen, Wei Chen, Yee Whye Teh, Ning Miao

TL;DR

Reward modeling provides scalable, automated feedback signals to enhance LLM reasoning, distinguishing outcome (ORM) and process (PRM) rewards and contrasting discriminative versus generative approaches. The survey formalizes RM taxonomy, surveys data construction and training methods for PRMs, and analyzes three key applications: test-time guidance, synthetic data/self-iteration, and online RL. It highlights that generative RMs often outperform discriminative ones and that PRMs offer finer-grained feedback but introduce training and computational challenges, especially for online RL where reward hacking remains a concern. The paper also evaluates RM benchmarks and metrics, finds current evaluations misalign with downstream performance, and emphasizes the need for data-efficient PRMs, generalist RMs, and comprehensive evaluation to advance RM-based LLM reasoning.

Abstract

Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we discuss critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.

Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

TL;DR

Reward modeling provides scalable, automated feedback signals to enhance LLM reasoning, distinguishing outcome (ORM) and process (PRM) rewards and contrasting discriminative versus generative approaches. The survey formalizes RM taxonomy, surveys data construction and training methods for PRMs, and analyzes three key applications: test-time guidance, synthetic data/self-iteration, and online RL. It highlights that generative RMs often outperform discriminative ones and that PRMs offer finer-grained feedback but introduce training and computational challenges, especially for online RL where reward hacking remains a concern. The paper also evaluates RM benchmarks and metrics, finds current evaluations misalign with downstream performance, and emphasizes the need for data-efficient PRMs, generalist RMs, and comprehensive evaluation to advance RM-based LLM reasoning.

Abstract

Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we discuss critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.

Paper Structure

This paper contains 37 sections, 6 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Illustration of three main applications of reward models in LLM reasoning. Green/red blocks denote higher/lower-quality candidates or intermediate steps; “o” denotes the final output. Left: test-time guidance. (Top) Sampling and selection: the LLM samples multiple answers and the RM selects the best one. (Middle) Search: a tree of steps is expanded; the RM scores nodes to guide expansion and chooses the terminal candidate. (Bottom) Refinement: failed steps are revised until an acceptable solution is produced. Middle: synthetic data curation. The LLM first samples raw examples; the RM filters them at the response level or step level, and the accepted set is fed back for self-iteration. Right: online RL training. The LLM performs multi-step rollouts; the RM supplies outcome or process rewards, based on which the LLM is updated.
  • Figure 2: Taxonomy of current research on process reward models
  • Figure 3: Applications of RMs in LLM reasoning
  • Figure 4: Comparisons of Llama and Qwen response styles in an example math question
  • Figure 5: The relationship between correctness scores (ProcessBench), BoN scores, and search-guiding performance (MCTS and Beam) for different PRMs when used with two different policy models (math-shepherd-mistral-7b-rl wang2024mathshepherd and Qwen2.5-7B-Instruct qwen2.5) on MATH500. Points in different colors denote the six PRM variants: Math‑Shepherd‑PRM‑7Bwang2024mathshepherd, Llama3.1‑8B‑PRM‑Mistral‑Dataxiong2024rlhflowmath, Skywork‑PRM‑1.5Bskyworkopeno12024, Skywork‑PRM‑7Bskyworkopeno12024, Qwen2.5‑Math‑7B‑PRM800Kzheng2024processbench, and Qwen2.5‑Math‑PRM‑7Bzhang2025prmlessons. The trend lines represent the fitted linear regression, and the shaded areas represent the 95% confidence intervals.
  • ...and 2 more figures