Table of Contents
Fetching ...

World2Act: Latent Action Post-Training via Skill-Compositional World Models

An Dinh Vuong, Tuan Van Vo, Abdullah Sohail, Haoran Ding, Liang Ma, Xiaodan Liang, Anqing Duan, Ivan Laptev, Ian Reid

TL;DR

An automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts is proposed, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons, enhancing embodied agent generalization.

Abstract

World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.

World2Act: Latent Action Post-Training via Skill-Compositional World Models

TL;DR

An automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts is proposed, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons, enhancing embodied agent generalization.

Abstract

World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.
Paper Structure (22 sections, 1 equation, 13 figures, 10 tables)

This paper contains 22 sections, 1 equation, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Prior WM$\rightarrow$VLA post-training supervises actions in pixel space, making policies sensitive to rollout artifacts. We instead transfer dynamics priors by aligning VLA action representations to WM video-dynamic latents, reducing dependence on pixels.
  • Figure 2: Skill-compositional video generation.(a)Data processing. We segment each demonstration by gripper-state changes and decompose the instruction into atomic skill prompts with an LLM. (b)Inference pipeline. The LLM first generates an ordered list of atomic prompts, which the finetuned WM leverages to generate one sub-video per skill. Sub-videos are concatenated to obtain the final video.
  • Figure 3: Video-length distribution. Across RoboCasa and LIBERO, decomposed skills exhibit shorter mean length and a more concentrated distribution than full sequences, improving stability for generating arbitrary-length videos.
  • Figure 4: World2Act overview.(a)Stage 1: Latent alignment. We train video and action adapters ($\mathcal{B}_\text{v},\mathcal{B}_\text{a}$) with reconstruction and contrastive objectives. (b)Stage 2: VLA post-training. We freeze the VLA and learn a residual policy guided by WM-induced latent dynamics. (c)Network architecture of residual policy.
  • Figure 5: Post-training scaling and generalization.Top: Cosine similarity strongly correlates with downstream success rate. Bottom-left: Scaling post-training trajectories improves World2Act but DreamGen is unstable. Bottom-right: Scaling the number of seen tasks during training improves generalization to 12 unseen tasks.
  • ...and 8 more figures