Table of Contents
Fetching ...

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang, Shu'ang Yu, Liangzhi Shi, Chao Yu, Chen Gao, Wei Wu, Yong Li

TL;DR

RoboScape-R demonstrates that a dual-world-model framework can serve as a unified, online RL environment for training generalizable embodied policies. By deriving an endogenous general reward from the world model—combining LPIPS-based goal alignment with a done signal—the approach enables stable, cross-scenario policy optimization without handcrafted rewards. Empirical results on ManiSkill show substantial out-of-domain generalization gains (average 37.5% improvement) and balanced multi-task performance when using the world-model reward, outperforming exogenous-reward baselines. The work highlights the potential of transformer-based world models as online training environments and points to future work on longer-horizon tasks and real-world applicability.

Abstract

Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

TL;DR

RoboScape-R demonstrates that a dual-world-model framework can serve as a unified, online RL environment for training generalizable embodied policies. By deriving an endogenous general reward from the world model—combining LPIPS-based goal alignment with a done signal—the approach enables stable, cross-scenario policy optimization without handcrafted rewards. Empirical results on ManiSkill show substantial out-of-domain generalization gains (average 37.5% improvement) and balanced multi-task performance when using the world-model reward, outperforming exogenous-reward baselines. The work highlights the potential of transformer-based world models as online training environments and points to future work on longer-horizon tasks and real-world applicability.

Abstract

Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.

Paper Structure

This paper contains 32 sections, 10 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overall structure of the proposed RoboScape-R pipeline. It mainly consists of a World Model-based Environment with General Reward and Policy Optimization. We designed a world model-based general reward to train the policy in multiple environments. The world model environment is a dual-world model structure, in which the action world receives the action and provide predicted observation while the text world model provide reward signal with a generated goal observation. This paradigm allows policy to interact with multiple environments to train a generalizable policy.
  • Figure 2: Success rate of policy trained with various rewards. We evaluate the SR for different policies with various reward modules before it converges.
  • Figure 3: Reward curve of a successful trajectory in the out-of-domain environment for pick&place task. It indicates that the world model-based reward is more generalizable to embedding-based and proxy-based rewards.
  • Figure 4: Visible cases for evaluation trajectories in the in-domain evaluations. The policy trained in the world model environment with general reward is comparable to those trained in physics simulators, while policies trained with proxy-based and embedding-based rewards exhibit inferior performance.
  • Figure 5: Visible cases for evaluation trajectories in the out-of-domain evaluations. Only the policy trained in the world model environment possesses generalization in OOD scenarios, while policies trained in other environments exhibit poor generalization since they only interact with a single environment.
  • ...and 4 more figures