Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Qiyuan Liu, Hao Xu, Xuhong Chen, Wei Chen, Yee Whye Teh, Ning Miao
TL;DR
Reward modeling provides scalable, automated feedback signals to enhance LLM reasoning, distinguishing outcome (ORM) and process (PRM) rewards and contrasting discriminative versus generative approaches. The survey formalizes RM taxonomy, surveys data construction and training methods for PRMs, and analyzes three key applications: test-time guidance, synthetic data/self-iteration, and online RL. It highlights that generative RMs often outperform discriminative ones and that PRMs offer finer-grained feedback but introduce training and computational challenges, especially for online RL where reward hacking remains a concern. The paper also evaluates RM benchmarks and metrics, finds current evaluations misalign with downstream performance, and emphasizes the need for data-efficient PRMs, generalist RMs, and comprehensive evaluation to advance RM-based LLM reasoning.
Abstract
Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we discuss critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.
