Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

Claudio Fanconi; Nicolás Astorga; Mihaela van der Schaar

Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar

TL;DR

The paper reframes LLM multi-step reasoning as an inverse reinforcement learning problem and learns a dense token-level reward $r_\phi$ from expert demonstrations to serve as both a training signal for the reasoning policy $\pi_\theta$ and an inference-time critic for reranking traces under a fixed compute budget. By using an adversarial IRL setup, the method produces rewards that correlate with answer correctness rather than superficial formatting, enabling interpretable localization of errors within reasoning traces. Empirical results on GSM8K with Llama3 and Qwen2.5 backbones show that reward-guided training and reranking can yield competitive performance with supervised fine-tuning and can surpass it in some configurations, while confirming a gap to outcome-based RL upper bounds. The approach proposes reusable process-level rewards that unify training, inference-time assistance, and diagnostic capabilities, with potential applicability beyond GSM8K to broader reasoning tasks.

Abstract

We reframe and operationalise adversarial inverse reinforcement learning (IRL) to large language model reasoning, learning a dense, token-level reward model for process supervision directly from expert demonstrations rather than imitating style via supervised fine-tuning. The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets. We demonstrate that our approach prioritises correctness over surface form, yielding scores that correlate with eventual answer validity and enabling interpretable localisation of errors within a trace. Empirically, on GSM8K with Llama3 and Qwen2.5 backbones, we demonstrate: (i) dense reasoning rewards can be used as a learning signal to elicit reasoning, and (ii) predictive performance is improved from reward-guided reranking (notably for Llama-based policies). By unifying training signals, inference-time selection, and token-level diagnostics into a single reasoning reward, this work suggests reusable process-level rewards with broad potential to enhance multi-step reasoning in language models.

Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

TL;DR

The paper reframes LLM multi-step reasoning as an inverse reinforcement learning problem and learns a dense token-level reward

from expert demonstrations to serve as both a training signal for the reasoning policy

and an inference-time critic for reranking traces under a fixed compute budget. By using an adversarial IRL setup, the method produces rewards that correlate with answer correctness rather than superficial formatting, enabling interpretable localization of errors within reasoning traces. Empirical results on GSM8K with Llama3 and Qwen2.5 backbones show that reward-guided training and reranking can yield competitive performance with supervised fine-tuning and can surpass it in some configurations, while confirming a gap to outcome-based RL upper bounds. The approach proposes reusable process-level rewards that unify training, inference-time assistance, and diagnostic capabilities, with potential applicability beyond GSM8K to broader reasoning tasks.

Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

TL;DR

Abstract

Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)