Table of Contents
Fetching ...

Chain of World: World Model Thinking in Latent Motion

Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, Baorui Ma

TL;DR

CoWVLA (Chain-of-World VLA), a new paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation that outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm.

Abstract

Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.

Chain of World: World Model Thinking in Latent Motion

TL;DR

CoWVLA (Chain-of-World VLA), a new paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation that outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm.

Abstract

Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.
Paper Structure (26 sections, 3 equations, 13 figures, 8 tables)

This paper contains 26 sections, 3 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Comparison of VLA pretraining strategies. (a) World Model: It predicts future visual frames, leading to redundant background reconstruction. (b) Latent Action: It learns the frame-to-frame transition using a visual encoder $E$, but lacks temporally continuous reasoning. (c) CoWVLA: Our method first uses a video encoder $E$ to decompose each video segment into motion and structure latents, and then trains the VLM to infer latent motion and predict the terminal frame of the segment given the instruction and the initial frame.
  • Figure 1: Sensitivity analysis of $N$ and $l_a$ on LIBERO.
  • Figure 2: Overview of the CoWVLA framework. CoWVLA consists of two core components: a latent motion extractor and a VLA decoder. The latent motion extractor, implemented as a video VAE, disentangles each video segment into a structure latent $z_s$ and two directional motion latents $z_m^h$ and $z_m^w$, which are concatenated into a unified latent motion vector $z_m$. The VLA decoder performs unified autoregressive modeling over multimodal sequences. During pre-training, the model takes the instruction and initial frame as input, and uses a learnable motion query $Q$ to predict the latent motion $\hat{z}_m$ while reconstructing the terminal frame of the video segment. During co-fine-tuning, the input expands into alternating keyframe–action pairs; $Q$ continues to aggregate temporally continuous latent dynamics, guiding multi-step action generation under sparse visual observations.
  • Figure 2: Cross-Recon visualization on LIBERO liu2023libero. The first six columns show temporally sampled frames from three rows: Structure (top), Motion (middle), and Cross-Recon (bottom). The Cross-Recon videos are generated by combining the static appearance from the Structure video with the motion representation extracted from the Motion video, revealing the transferred motion patterns. Each Cross-Recon frame is overlaid with a motion heatmap to highlight dynamic regions. The last column presents three summary maps: motion heatmaps obtained by averaging and maximizing per-frame absolute differences between Cross-Recon and Structure, and the end-effector trajectory estimated from the motion regions.
  • Figure 3: Visualization of the disentangled motion and structure latents. We select two frames ($t_1$ and $t_2$) and show the original (Orig.) and reconstructed (Recon.) frames. "M. Recon." and "S. Recon." denote the reconstructions obtained by decoding only the motion latent or only the structure latent, respectively. The structure latent preserves the global scene layout, whereas the motion latent captures motion and fine-grained temporal details.
  • ...and 8 more figures