Table of Contents
Fetching ...

Escaping the Verifier: Learning to Reason via Demonstrations

Locke Cai, Ivan Provilkov

TL;DR

This work addresses the challenge of training reasoning-capable LLMs when task-specific verifiers are unavailable by introducing RARO, a Relativistic Adversarial Reasoning Optimization framework that learns exclusively from expert demonstrations via inverse reinforcement learning. RARO casts reasoning as an adversarial game between a shared policy and a relativistic critic that compares policy and expert outputs in pairs, enabling stable joint RL without external preferences. Across Countdown, DeepMath, and Poetry Writing, RARO outperforms verifier-free baselines and matches or approaches RLVR in verifiable settings, while also exhibiting strong test-time scaling and model-size scalability in non-verifiable domains. The approach demonstrates that robust reasoning can be elicited from demonstrations alone, potentially broadening the applicability of powerful reasoning agents to tasks lacking clear verifiers and expensive human preferences.

Abstract

Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL with verifiers. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

Escaping the Verifier: Learning to Reason via Demonstrations

TL;DR

This work addresses the challenge of training reasoning-capable LLMs when task-specific verifiers are unavailable by introducing RARO, a Relativistic Adversarial Reasoning Optimization framework that learns exclusively from expert demonstrations via inverse reinforcement learning. RARO casts reasoning as an adversarial game between a shared policy and a relativistic critic that compares policy and expert outputs in pairs, enabling stable joint RL without external preferences. Across Countdown, DeepMath, and Poetry Writing, RARO outperforms verifier-free baselines and matches or approaches RLVR in verifiable settings, while also exhibiting strong test-time scaling and model-size scalability in non-verifiable domains. The approach demonstrates that robust reasoning can be elicited from demonstrations alone, potentially broadening the applicability of powerful reasoning agents to tasks lacking clear verifiers and expensive human preferences.

Abstract

Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL with verifiers. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

Paper Structure

This paper contains 76 sections, 3 theorems, 53 equations, 21 figures, 13 tables, 3 algorithms.

Key Result

Proposition A.1

Consider the KL-regularized reward-maximization objective: The optimal policy has the following closed-form solution: where $Z_{\theta^\star(\phi)}(q)$ is the partition function ensuring normalization.

Figures (21)

  • Figure 1: Overview of RARO. The method creates an adversarial game between a policy and a relativistic critic that share the same weights. The critic is rewarded for identifying the experts among (expert, policy) answer pairs, while the policy is rewarded for deceiving the critic. Additionally, the critic can declare tie, yielding stable rewards when the it is unsure. Both the policy and the critic are trained jointly and continuously via RL.
  • Figure 1: Main Countdown Results. RARO against baselines at a fixed reasoning budget of 2048 tokens.
  • Figure 3: Performance scaling. RARO consistently improves with model size (1.5B to 7B) across both DeepMath and Poetry Writing.
  • Figure 4: Stable Reward and Length Growth. The validation reward and response length of RARO on DeepMath (1.5B) continuously grows over time, indicating a stable dynamic.
  • Figure 5: Test-time Scaling (TTS) on DeepMath. Performance improves as the number of rollouts ($N$) increases for all model sizes. See Table \ref{['tab:deepmath_tts']} in Appendix \ref{['sec:additional_tables_figures']} for detailed data.
  • ...and 16 more figures

Theorems & Definitions (6)

  • Proposition A.1
  • proof
  • Proposition A.2
  • proof
  • Proposition A.3
  • proof