Table of Contents
Fetching ...

WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, Shanghang Zhang

TL;DR

WristWorld tackles the scarcity of wrist-view data in robotic manipulation by proposing a two-stage 4D Generative World Model that converts anchor-view observations into wrist-view videos. The Reconstruction stage extends VGGT with a WristHead to estimate wrist poses and 4D geometry under a Spatial Projection Consistency loss, while the Generation stage uses a diffusion-based generator conditioned on these projections and CLIP-encoded semantics to synthesize temporally coherent wrist views. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art wrist-view generation with superior spatial/temporal fidelity and yield tangible VLA gains (e.g., Calvin Avg Len +3.81% and 42.4% anchor–wrist gap reduction). WristWorld also plugs into single-view world models to enrich perception and control without additional wrist data, offering a scalable path to multi-view robotics training.

Abstract

Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

TL;DR

WristWorld tackles the scarcity of wrist-view data in robotic manipulation by proposing a two-stage 4D Generative World Model that converts anchor-view observations into wrist-view videos. The Reconstruction stage extends VGGT with a WristHead to estimate wrist poses and 4D geometry under a Spatial Projection Consistency loss, while the Generation stage uses a diffusion-based generator conditioned on these projections and CLIP-encoded semantics to synthesize temporally coherent wrist views. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art wrist-view generation with superior spatial/temporal fidelity and yield tangible VLA gains (e.g., Calvin Avg Len +3.81% and 42.4% anchor–wrist gap reduction). WristWorld also plugs into single-view world models to enrich perception and control without additional wrist data, offering a scalable path to multi-view robotics training.

Abstract

Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

Paper Structure

This paper contains 30 sections, 12 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: We present WristWorld, a framework that synthesizes realistic wrist-view videos from anchor views through a two-stage process: a reconstruction stage for estimating wrist-view projections, and a generation stage for producing coherent wrist-view videos. The generated wrist observations effectively expanding training data to novel view and lead to significant performance improvements for downstream VLA models across various tasks.
  • Figure 2: Overview of our method. We introduce a two-stage 4D Generative World Model. In the reconstruction stage, VGGT wang2025vggt is extended with a wrist head to regress wrist pose, guided by a Spatial Projection Consistency Loss that supervises directly from RGB without depth or extrinsics. The predicted pose projects point clouds into the wrist view. In the generation stage, these projections, combined with external-view CLIP embeddings, condition a video generator to synthesize wrist-view sequences. Without first-frame guidance, the model produces additional wrist views for VLA datasets, yielding substantial performance gains.
  • Figure 3: Spatial Projection Consistency (SPC) loss. We first establish anchor–wrist 2D point matching and then lift the matched pixels to 2D–3D correspondences using the reconstructed point cloud. The 3D points are subsequently projected into the wrist view with the predicted wrist pose, after which the SPC loss is computed to enforce geometric consistency.
  • Figure 4: Visualization of our generation result. As illustrated in the figure, we compare our generated condition maps against the 3D Base (VGGT without the SPC Loss), where our approach demonstrates superior viewpoint consistency. Furthermore, in comparison to the WoW 14B chi2025wow baseline which based on Wan 14B wan2025wanopenadvancedlargescale, our method achieves both higher generation quality and improved viewpoint alignment accuracy. These results highlight the effectiveness of our framework and underscore its potential to serve as training data for downstream VLA models.
  • Figure 5: Visualization on the Calvin mees2022calvin benchmark. We compare our generated wrist-view videos (bottom row) with the ground truth (second row) and a baseline method (third row, Stable Video Diffusion blattmann2023stablevideodiffusionscaling). Our approach achieves better spatial and viewpoint consistency than the baseline, while also producing more faithful wrist-view frames. These results highlight the effectiveness of our method in bridging anchor-view and wrist-view perspectives.
  • ...and 3 more figures