Table of Contents
Fetching ...

Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, Dimitris N. Metaxas

TL;DR

<p> AIRL-S addresses the fragmentation between reinforcement-learning-based and search-based test-time scaling for large language models by learning a dense, step-wise process reward model (PRM) via adversarial inverse reinforcement learning and guiding policy optimization with GRPO. The learned PRM serves as both a critic during RL training and a high-quality heuristic for search-time reasoning, enabling robust chain-of-thought extensions and mitigating reward hacking. Empirical results across eight mathematics, science, and coding benchmarks show a $\sim$9% average accuracy improvement over the base model, matching GPT-4o, and superior PRM-guided search across multiple TTS methods with reduced dependence on labeled data. The work demonstrates that the reward function learned during RL is effectively the best PRM for search, offering a cost-efficient, generalizable approach to complex reasoning in LLMs with strong practical impact for scalable reasoning and debugging of AI systems.

Abstract

Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.

Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

TL;DR

<p> AIRL-S addresses the fragmentation between reinforcement-learning-based and search-based test-time scaling for large language models by learning a dense, step-wise process reward model (PRM) via adversarial inverse reinforcement learning and guiding policy optimization with GRPO. The learned PRM serves as both a critic during RL training and a high-quality heuristic for search-time reasoning, enabling robust chain-of-thought extensions and mitigating reward hacking. Empirical results across eight mathematics, science, and coding benchmarks show a 9% average accuracy improvement over the base model, matching GPT-4o, and superior PRM-guided search across multiple TTS methods with reduced dependence on labeled data. The work demonstrates that the reward function learned during RL is effectively the best PRM for search, offering a cost-efficient, generalizable approach to complex reasoning in LLMs with strong practical impact for scalable reasoning and debugging of AI systems.

Abstract

Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.

Paper Structure

This paper contains 47 sections, 10 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of AIRL-S . During training, AIRL-S uses the AIRL discriminator to learn a PRM and optimizes the policy with both dense rewards from AIRL and outcome rewards from GRPO. At test time, the trained policy and PRM jointly guide downstream search algorithms.
  • Figure 2: Average performance of four PRMs applied to four generative LLMs using Best-of-N with 64 rollouts on AIME2024, AMC, and MATH500. Our PRM (Qwen2.5-AIRL-S-PRM) consistently delivers the highest test-time search performance across all models and datasets.
  • Figure 3: Comparison of test-time search performance with our PRM applied to MCTS, Beam Search, and Best-of-N across varying rollout counts. Our PRM consistently improves performance for all search techniques.
  • Figure 4: Comparison of AIRL-S and GRPO trained with outcome rewards only. AIRL-S improves training and validation performance and enables longer response generation at test time.
  • Figure 5: Performance of different PRM aggregation techniques evaluated on MATH500.
  • ...and 3 more figures