Table of Contents
Fetching ...

Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation

Zitong Bo, Yue Hu, Jinming Ma, Mingliang Zhou, Junhui Yin, Yachen Kang, Yuqi Liu, Tong Wu, Diyun Xiang, Hao Chen

TL;DR

REVER converts vision-language models into reliable long-horizon planners for real-world manipulation guided by natural language. It introduces a verifiable reward comprising format and content components, and uses GRPO for reinforcement fine-tuning to produce executable, grammar-consistent plans that a monitor can verify during execution. A large-scale, real-world LEAP dataset is built from UMI kinesthetic demonstrations, enabling open-source training and evaluation; RoboFarseer, a 7B model, achieves competitive planning performance against larger models and delivers substantial real-world success gains (~60%) when integrated into a hierarchical planning-and-control system. By providing interpretable chain-of-thought plans, a closed-loop verification mechanism, and open datasets/code, the work advances practical embodied AI for open-world manipulation.

Abstract

Enabling robots to execute long-horizon manipulation tasks from free-form language instructions remains a fundamental challenge in embodied AI. While vision-language models (VLMs) have shown promise as high-level planners, their deployment in the real world is hindered by two gaps: (i) the scarcity of large-scale, sequential manipulation data that couples natural language with multi-step action plans, and (ii) the absence of dense, interpretable rewards for fine-tuning VLMs on planning objectives. To address these issues, we propose REVER, a framework that empowers VLMs to generate and validate long-horizon manipulation plans from natural language instructions in real-world scenarios. Under REVER we train and release RoboFarseer, a VLM incentivized to emit chain-of-thought that perform temporal and spatial reasoning, ensuring physically plausible and logically coherent plans. To obtain training data, we leverage the Universal Manipulation Interface framework to capture hardware-agnostic demonstrations of atomic skills. An automated annotation engine converts each demonstration into vision-instruction-plan triplet. We introduce a verifiable reward that scores the generated plan by its ordered bipartite matching overlap with the ground-truth skill sequence. At run time, the fine-tuned VLM functions both as a planner and as a monitor, verifying step-wise completion. RoboFarseer matches or exceeds the performance of proprietary models that are orders of magnitude larger, while on open-ended planning it surpasses the best baseline by more than 40%. In real-world, long-horizon tasks, the complete system boosts overall success by roughly 60% compared with the same low-level controller without the planner. We will open-source both the dataset and the trained model upon publication.

Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation

TL;DR

REVER converts vision-language models into reliable long-horizon planners for real-world manipulation guided by natural language. It introduces a verifiable reward comprising format and content components, and uses GRPO for reinforcement fine-tuning to produce executable, grammar-consistent plans that a monitor can verify during execution. A large-scale, real-world LEAP dataset is built from UMI kinesthetic demonstrations, enabling open-source training and evaluation; RoboFarseer, a 7B model, achieves competitive planning performance against larger models and delivers substantial real-world success gains (~60%) when integrated into a hierarchical planning-and-control system. By providing interpretable chain-of-thought plans, a closed-loop verification mechanism, and open datasets/code, the work advances practical embodied AI for open-world manipulation.

Abstract

Enabling robots to execute long-horizon manipulation tasks from free-form language instructions remains a fundamental challenge in embodied AI. While vision-language models (VLMs) have shown promise as high-level planners, their deployment in the real world is hindered by two gaps: (i) the scarcity of large-scale, sequential manipulation data that couples natural language with multi-step action plans, and (ii) the absence of dense, interpretable rewards for fine-tuning VLMs on planning objectives. To address these issues, we propose REVER, a framework that empowers VLMs to generate and validate long-horizon manipulation plans from natural language instructions in real-world scenarios. Under REVER we train and release RoboFarseer, a VLM incentivized to emit chain-of-thought that perform temporal and spatial reasoning, ensuring physically plausible and logically coherent plans. To obtain training data, we leverage the Universal Manipulation Interface framework to capture hardware-agnostic demonstrations of atomic skills. An automated annotation engine converts each demonstration into vision-instruction-plan triplet. We introduce a verifiable reward that scores the generated plan by its ordered bipartite matching overlap with the ground-truth skill sequence. At run time, the fine-tuned VLM functions both as a planner and as a monitor, verifying step-wise completion. RoboFarseer matches or exceeds the performance of proprietary models that are orders of magnitude larger, while on open-ended planning it surpasses the best baseline by more than 40%. In real-world, long-horizon tasks, the complete system boosts overall success by roughly 60% compared with the same low-level controller without the planner. We will open-source both the dataset and the trained model upon publication.

Paper Structure

This paper contains 29 sections, 7 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Overview of RoboFarseer.
  • Figure 2: An overview of REVER framework.
  • Figure 3: An example execution trajectory of RoboFarseer.
  • Figure 4: Planning accuracy (%) on planning benchmarks.
  • Figure 5: Planning score with bipartite matching on LEAP open-planning test set.
  • ...and 3 more figures