rePIRL: Learn PRM with Inverse RL for LLM Reasoning

Xian Wu; Kaijie Zhu; Ying Zhang; Lun Wang; Wenbo Guo

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

Xian Wu, Kaijie Zhu, Ying Zhang, Lun Wang, Wenbo Guo

TL;DR

rePIRL reframes multi-step LLM reasoning as a token-level MDP and learns a process reward model (PRM) with minimal assumptions via an inverse RL–inspired dual learning loop. It defines an energy-based reward $r_\phi(s_t,a_t)$ with latent hidden variables and uses importance sampling to align the PRM with expert trajectories without token-level rewards or access to the expert policy, while updating the policy under a maximum-entropy objective. The framework unifies online and offline PRM methods (e.g., PRIME, MCTS, DPO, DQO) under weaker assumptions and demonstrates strong gains on standard math and coding benchmarks, along with practical uses in test-time training, test-time scaling, and hard-problem signaling. These results suggest PRMs learned via rePIRL can guide efficient policy optimization in LLM reasoning and enable broader, safer applications of IRL-based reward shaping.

Abstract

Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

TL;DR

with latent hidden variables and uses importance sampling to align the PRM with expert trajectories without token-level rewards or access to the expert policy, while updating the policy under a maximum-entropy objective. The framework unifies online and offline PRM methods (e.g., PRIME, MCTS, DPO, DQO) under weaker assumptions and demonstrates strong gains on standard math and coding benchmarks, along with practical uses in test-time training, test-time scaling, and hard-problem signaling. These results suggest PRMs learned via rePIRL can guide efficient policy optimization in LLM reasoning and enable broader, safer applications of IRL-based reward shaping.

Abstract

Paper Structure (23 sections, 5 theorems, 21 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 5 theorems, 21 equations, 3 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Key Technique
Problem Setup
rePIRL Learning Framework
Integrate SOTA Methods into our Framework
Evaluation
Experiment Setup
Main Experiments
Applications
rePIRL with PRM only
Discussion
Conclusion and Future Works
Theoretical Proof
Proof of Theorem \ref{['theo:1']}
...and 8 more sections

Key Result

Theorem 1

The optimal solution for the maximize entropy RL objective in Eqn. eqref:final_policy is $\pi_*(a|s) = \text{exp}(\frac{1}{\beta}[Q^*(s,a) - V^*(s)])$, where $V^{\pi}(s_t) = \mathbb{E}_{a_t\sim\pi(\cdot|s_t)} [r(s_t,a_t)+\gamma V^{\pi}(s_{t+1})] + \beta\mathcal{H}(\pi(\cdot|s_t))$ is the soft state-

Figures (3)

Figure 1: Performance of three applications of our PRM (\ref{['subsec:eval_app']}) and rePIRL without outcome reward (\ref{['subsec:eval_prm']}).
Figure 2: Comparison of rePIRL using Claude-3.7-Sonnet versus DeepSeek-R1 as expert trajectory generators.
Figure 3: Ablation study comparing rePIRL with importance sampling versus baselines with only increased policy updates.

Theorems & Definitions (5)

Theorem 1
Proposition 1
Proposition 2
Proposition 3
Proposition 4

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

TL;DR

Abstract

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)