Table of Contents
Fetching ...

RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

Zijing Zhang, Ziyang Chen, Mingxiao Li, Zhaopeng Tu, Xiaolong Li

TL;DR

This work identifies inefficient exploration as a key limitation of outcome-only RL for long-horizon tasks. It introduces RLVMR, a framework that enriches RL with dense, verifiable meta-reasoning rewards for planning, exploration, and reflection, plus a cold-start SFT phase and a GRPO-based optimization (GRPO-MR). The approach yields state-of-the-art results on ALFWorld and ScienceWorld across model sizes, with strong generalization to unseen tasks and markedly higher exploration efficiency. The findings demonstrate that supervising the reasoning process itself—rather than relying solely on final outcomes—produces more robust, interpretable, and data-efficient agents, paving the way for scalable, trustworthy long-horizon AI systems.

Abstract

The development of autonomous agents for complex, long-horizon tasks is a central goal in AI. However, dominant training paradigms face a critical limitation: reinforcement learning (RL) methods that optimize solely for final task success often reinforce flawed or inefficient reasoning paths, a problem we term inefficient exploration. This leads to agents that are brittle and fail to generalize, as they learn to find solutions without learning how to reason coherently. To address this, we introduce RLVMR, a novel framework that integrates dense, process-level supervision into end-to-end RL by rewarding verifiable, meta-reasoning behaviors. RLVMR equips an agent to explicitly tag its cognitive steps, such as planning, exploration, and reflection, and provides programmatic, rule-based rewards for actions that contribute to effective problem-solving. These process-centric rewards are combined with the final outcome signal and optimized using a critic-free policy gradient method. On the challenging ALFWorld and ScienceWorld benchmarks, RLVMR achieves new state-of-the-art results, with our 7B model reaching an 83.6% success rate on the most difficult unseen task split. Our analysis confirms these gains stem from improved reasoning quality, including significant reductions in redundant actions and enhanced error recovery, leading to more robust, efficient, and interpretable agents.

RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

TL;DR

This work identifies inefficient exploration as a key limitation of outcome-only RL for long-horizon tasks. It introduces RLVMR, a framework that enriches RL with dense, verifiable meta-reasoning rewards for planning, exploration, and reflection, plus a cold-start SFT phase and a GRPO-based optimization (GRPO-MR). The approach yields state-of-the-art results on ALFWorld and ScienceWorld across model sizes, with strong generalization to unseen tasks and markedly higher exploration efficiency. The findings demonstrate that supervising the reasoning process itself—rather than relying solely on final outcomes—produces more robust, interpretable, and data-efficient agents, paving the way for scalable, trustworthy long-horizon AI systems.

Abstract

The development of autonomous agents for complex, long-horizon tasks is a central goal in AI. However, dominant training paradigms face a critical limitation: reinforcement learning (RL) methods that optimize solely for final task success often reinforce flawed or inefficient reasoning paths, a problem we term inefficient exploration. This leads to agents that are brittle and fail to generalize, as they learn to find solutions without learning how to reason coherently. To address this, we introduce RLVMR, a novel framework that integrates dense, process-level supervision into end-to-end RL by rewarding verifiable, meta-reasoning behaviors. RLVMR equips an agent to explicitly tag its cognitive steps, such as planning, exploration, and reflection, and provides programmatic, rule-based rewards for actions that contribute to effective problem-solving. These process-centric rewards are combined with the final outcome signal and optimized using a critic-free policy gradient method. On the challenging ALFWorld and ScienceWorld benchmarks, RLVMR achieves new state-of-the-art results, with our 7B model reaching an 83.6% success rate on the most difficult unseen task split. Our analysis confirms these gains stem from improved reasoning quality, including significant reductions in redundant actions and enhanced error recovery, leading to more robust, efficient, and interpretable agents.

Paper Structure

This paper contains 46 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Reinforcement learning with outcome-only rewards (e.g., GRPO) improves performance over vanilla models but fosters inefficient exploration, characterized by high rates of repetitive actions that hinder generalization to unseen tasks. In contrast, our proposed RLVMR significantly improves success rates and generalization by directly mitigating this inefficient exploration.
  • Figure 2: Comparison of LLM agent RL training paradigms: (a) Standard RL with outcome-only rewards (e.g., GRPO) inadvertently reinforces trajectories with inefficient or illogical intermediate reasoning steps. (b) Our RLVMR approach provides dense, verifiable rewards for beneficial meta-reasoning behaviors (e.g., T1-T4), directly shaping a more robust and coherent reasoning process.
  • Figure 3: Performance of SFT and GRPO on ALFWorld. While SFT excels on seen tasks (L0) but fails to generalize, GRPO achieves better generalization at the cost of significant inefficiency (high action counts and redundancy). This highlights a fundamental trade-off between brittle efficiency and inefficient generalization.
  • Figure 4: A schematic diagram of the RLVMR framework, which consists of two training phases: cold start and reinforcement learning. Our method provides rule-verifiable feedback signals based on the final outcome and the relative advantages of different types of meta-reasoning behaviors.
  • Figure 5: Exploration efficiency of RLVMR compared to SFT and GRPO baselines on ALFWorld. RLVMR consistently and significantly reduces both invalid and repetitive actions across all generalization levels and model sizes, demonstrating its effectiveness at mitigating inefficient exploration.
  • ...and 1 more figures