Table of Contents
Fetching ...

Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar

TL;DR

The paper reframes LLM multi-step reasoning as an inverse reinforcement learning problem and learns a dense token-level reward $r_\phi$ from expert demonstrations to serve as both a training signal for the reasoning policy $\pi_\theta$ and an inference-time critic for reranking traces under a fixed compute budget. By using an adversarial IRL setup, the method produces rewards that correlate with answer correctness rather than superficial formatting, enabling interpretable localization of errors within reasoning traces. Empirical results on GSM8K with Llama3 and Qwen2.5 backbones show that reward-guided training and reranking can yield competitive performance with supervised fine-tuning and can surpass it in some configurations, while confirming a gap to outcome-based RL upper bounds. The approach proposes reusable process-level rewards that unify training, inference-time assistance, and diagnostic capabilities, with potential applicability beyond GSM8K to broader reasoning tasks.

Abstract

We reframe and operationalise adversarial inverse reinforcement learning (IRL) to large language model reasoning, learning a dense, token-level reward model for process supervision directly from expert demonstrations rather than imitating style via supervised fine-tuning. The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets. We demonstrate that our approach prioritises correctness over surface form, yielding scores that correlate with eventual answer validity and enabling interpretable localisation of errors within a trace. Empirically, on GSM8K with Llama3 and Qwen2.5 backbones, we demonstrate: (i) dense reasoning rewards can be used as a learning signal to elicit reasoning, and (ii) predictive performance is improved from reward-guided reranking (notably for Llama-based policies). By unifying training signals, inference-time selection, and token-level diagnostics into a single reasoning reward, this work suggests reusable process-level rewards with broad potential to enhance multi-step reasoning in language models.

Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

TL;DR

The paper reframes LLM multi-step reasoning as an inverse reinforcement learning problem and learns a dense token-level reward from expert demonstrations to serve as both a training signal for the reasoning policy and an inference-time critic for reranking traces under a fixed compute budget. By using an adversarial IRL setup, the method produces rewards that correlate with answer correctness rather than superficial formatting, enabling interpretable localization of errors within reasoning traces. Empirical results on GSM8K with Llama3 and Qwen2.5 backbones show that reward-guided training and reranking can yield competitive performance with supervised fine-tuning and can surpass it in some configurations, while confirming a gap to outcome-based RL upper bounds. The approach proposes reusable process-level rewards that unify training, inference-time assistance, and diagnostic capabilities, with potential applicability beyond GSM8K to broader reasoning tasks.

Abstract

We reframe and operationalise adversarial inverse reinforcement learning (IRL) to large language model reasoning, learning a dense, token-level reward model for process supervision directly from expert demonstrations rather than imitating style via supervised fine-tuning. The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets. We demonstrate that our approach prioritises correctness over surface form, yielding scores that correlate with eventual answer validity and enabling interpretable localisation of errors within a trace. Empirically, on GSM8K with Llama3 and Qwen2.5 backbones, we demonstrate: (i) dense reasoning rewards can be used as a learning signal to elicit reasoning, and (ii) predictive performance is improved from reward-guided reranking (notably for Llama-based policies). By unifying training signals, inference-time selection, and token-level diagnostics into a single reasoning reward, this work suggests reusable process-level rewards with broad potential to enhance multi-step reasoning in language models.

Paper Structure

This paper contains 37 sections, 8 equations, 22 figures, 5 tables, 1 algorithm.

Figures (22)

  • Figure 1: Eliciting expert reasoning via adversarial inverse reinforcement learning. The model learns a reasoning reward function from expert demonstrations using adversarial IRL.
  • Figure 2: Training behaviour of the reward and correctness for Llama3.1-8B as the policy with Llama3.2-1B as the discriminator. Left (\ref{['fig:llama8b-rewards-training']}): aggregate learned reward over training steps (train/eval). Right (\ref{['fig:llama8b-correctness-training']}): correctness accuracy (train/eval).
  • Figure 3: Benefit of the reasoning reward at inference for Llama3.1-8B with a Llama3.2-1B critic. Left(\ref{['fig:correctness_distribution']}): reward distributions for correct versus incorrect answers. Right (\ref{['fig:passatkN']}): $\text{pass@}k \mid 16$ using reward-guided reranking versus random ranking.
  • Figure 4: Correlation of the learned reward for Llama3.1-8B with Llama3.2-1B. Shows correlations between the (discounted) learned reward on answer tokens and verifiable signals: correctness and formatting structure.
  • Figure 5: Correct answering. Dense reward on a correct solution shows contiguous positive bands on decisive computations. Rewards are standardised and discounted with $\gamma=0.9$.
  • ...and 17 more figures