Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning
Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar
TL;DR
The paper reframes LLM multi-step reasoning as an inverse reinforcement learning problem and learns a dense token-level reward $r_\phi$ from expert demonstrations to serve as both a training signal for the reasoning policy $\pi_\theta$ and an inference-time critic for reranking traces under a fixed compute budget. By using an adversarial IRL setup, the method produces rewards that correlate with answer correctness rather than superficial formatting, enabling interpretable localization of errors within reasoning traces. Empirical results on GSM8K with Llama3 and Qwen2.5 backbones show that reward-guided training and reranking can yield competitive performance with supervised fine-tuning and can surpass it in some configurations, while confirming a gap to outcome-based RL upper bounds. The approach proposes reusable process-level rewards that unify training, inference-time assistance, and diagnostic capabilities, with potential applicability beyond GSM8K to broader reasoning tasks.
Abstract
We reframe and operationalise adversarial inverse reinforcement learning (IRL) to large language model reasoning, learning a dense, token-level reward model for process supervision directly from expert demonstrations rather than imitating style via supervised fine-tuning. The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets. We demonstrate that our approach prioritises correctness over surface form, yielding scores that correlate with eventual answer validity and enabling interpretable localisation of errors within a trace. Empirically, on GSM8K with Llama3 and Qwen2.5 backbones, we demonstrate: (i) dense reasoning rewards can be used as a learning signal to elicit reasoning, and (ii) predictive performance is improved from reward-guided reranking (notably for Llama-based policies). By unifying training signals, inference-time selection, and token-level diagnostics into a single reasoning reward, this work suggests reusable process-level rewards with broad potential to enhance multi-step reasoning in language models.
