Table of Contents
Fetching ...

WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making

Zhilong Zhang, Ruifeng Chen, Junyin Ye, Yihao Sun, Pengyuan Wang, Jingcheng Pang, Kaiyuan Li, Tianshuo Liu, Haoxin Lin, Yang Yu, Zhi-Hua Zhou

TL;DR

This paper introduces WHALE, a framework for learning generalizable world models, consisting of two key techniques: behavior-conditioning and retracing-rollout, and proposes Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability.

Abstract

World models play a crucial role in decision-making within embodied environments, enabling cost-free explorations that would otherwise be expensive in the real world. To facilitate effective decision-making, world models must be equipped with strong generalizability to support faithful imagination in out-of-distribution (OOD) regions and provide reliable uncertainty estimation to assess the credibility of the simulated experiences, both of which present significant challenges for prior scalable approaches. This paper introduces WHALE, a framework for learning generalizable world models, consisting of two key techniques: behavior-conditioning and retracing-rollout. Behavior-conditioning addresses the policy distribution shift, one of the primary sources of the world model generalization error, while retracing-rollout enables efficient uncertainty estimation without the necessity of model ensembles. These techniques are universal and can be combined with any neural network architecture for world model learning. Incorporating these two techniques, we present Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability. We demonstrate the superiority of Whale-ST in simulation tasks by evaluating both value estimation accuracy and video generation fidelity. Additionally, we examine the effectiveness of our uncertainty estimation technique, which enhances model-based policy optimization in fully offline scenarios. Furthermore, we propose Whale-X, a 414M parameter world model trained on 970K trajectories from Open X-Embodiment datasets. We show that Whale-X exhibits promising scalability and strong generalizability in real-world manipulation scenarios using minimal demonstrations.

WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making

TL;DR

This paper introduces WHALE, a framework for learning generalizable world models, consisting of two key techniques: behavior-conditioning and retracing-rollout, and proposes Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability.

Abstract

World models play a crucial role in decision-making within embodied environments, enabling cost-free explorations that would otherwise be expensive in the real world. To facilitate effective decision-making, world models must be equipped with strong generalizability to support faithful imagination in out-of-distribution (OOD) regions and provide reliable uncertainty estimation to assess the credibility of the simulated experiences, both of which present significant challenges for prior scalable approaches. This paper introduces WHALE, a framework for learning generalizable world models, consisting of two key techniques: behavior-conditioning and retracing-rollout. Behavior-conditioning addresses the policy distribution shift, one of the primary sources of the world model generalization error, while retracing-rollout enables efficient uncertainty estimation without the necessity of model ensembles. These techniques are universal and can be combined with any neural network architecture for world model learning. Incorporating these two techniques, we present Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability. We demonstrate the superiority of Whale-ST in simulation tasks by evaluating both value estimation accuracy and video generation fidelity. Additionally, we examine the effectiveness of our uncertainty estimation technique, which enhances model-based policy optimization in fully offline scenarios. Furthermore, we propose Whale-X, a 414M parameter world model trained on 970K trajectories from Open X-Embodiment datasets. We show that Whale-X exhibits promising scalability and strong generalizability in real-world manipulation scenarios using minimal demonstrations.

Paper Structure

This paper contains 57 sections, 2 theorems, 15 equations, 33 figures, 12 tables.

Key Result

Proposition A.2

Under Assumption app:assump, for any policy $\pi$, the value gap of common dynamics model $T$ without behavior-conditioning has an upper bound: where $W_1(d^\pi,d^\Pi)$ is the Wasserstein-1 distance between the $\pi$-induced trajectory distribution $d^\pi(\tau)$ and the behavior trajectory distribution $d^\Pi(\tau)=\mathbb E_{\mu\sim\Pi}[d^\mu(\tau)]$.

Figures (33)

  • Figure 1: Qualitative evaluation on Meta-World, Open X-Embodiment, and our real-world tasks.
  • Figure 2: Illustration of retracing-rollout uncertainty qunatifier.
  • Figure 3: Overall architecture of Whale-ST. The behavior-conditioning model encodes the observation and action subsequences into behavior embedding $z_i$, which are then passed to the dynamics model along with observation tokens and actions to generate the next token predictions $\hat{x}_{i+1}$. The predicted observation tokens are subsequently fed into the dynamics model for further predictions autoregressively and decoded into observation predictions to obtain later behavior embeddings.
  • Figure 4: Offline reinforcement learning with different uncertainty estimation methods.
  • Figure 5: Physical robot evaluation on unseen scenarios. The row above shows the bar chart of the consistency rate, and the row below represents the tasks used for testing. The experiments demonstrate that Whale-X exhibits good generalization performance in unseen scenarios.
  • ...and 28 more figures

Theorems & Definitions (2)

  • Proposition A.2
  • Proposition A.3