RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

Yinzhou Tang; Yu Shang; Yinuo Chen; Bingwen Wei; Xin Zhang; Shu'ang Yu; Liangzhi Shi; Chao Yu; Chen Gao; Wei Wu; Yong Li

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang, Shu'ang Yu, Liangzhi Shi, Chao Yu, Chen Gao, Wei Wu, Yong Li

TL;DR

RoboScape-R demonstrates that a dual-world-model framework can serve as a unified, online RL environment for training generalizable embodied policies. By deriving an endogenous general reward from the world model—combining LPIPS-based goal alignment with a done signal—the approach enables stable, cross-scenario policy optimization without handcrafted rewards. Empirical results on ManiSkill show substantial out-of-domain generalization gains (average 37.5% improvement) and balanced multi-task performance when using the world-model reward, outperforming exogenous-reward baselines. The work highlights the potential of transformer-based world models as online training environments and points to future work on longer-horizon tasks and real-world applicability.

Abstract

Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

TL;DR

Abstract

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)