Table of Contents
Fetching ...

Co-Evolving Latent Action World Models

Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang, Jiang Bian

TL;DR

CoLA-World tackles the challenge of jointly learning latent actions with a pre-trained diffusion-based world model by introducing a warm-up alignment phase that prevents representational collapse. The world model acts as a tutor, providing gradients to shape a high-quality latent-action space $z_t$, while the latent actions offer a precise control interface that enhances the world model's predictive power on observations $o_t$ and $o_{t+1}$. Compared with two-stage pipelines, CoLA-World achieves equal or higher video simulation quality and downstream visual planning, with better sample efficiency and robustness to real-action adapters. This co-evolutionary paradigm provides a scalable path toward generalist latent-action–based world models, with potential for broader vision-language-latent-action extensions.

Abstract

Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.

Co-Evolving Latent Action World Models

TL;DR

CoLA-World tackles the challenge of jointly learning latent actions with a pre-trained diffusion-based world model by introducing a warm-up alignment phase that prevents representational collapse. The world model acts as a tutor, providing gradients to shape a high-quality latent-action space , while the latent actions offer a precise control interface that enhances the world model's predictive power on observations and . Compared with two-stage pipelines, CoLA-World achieves equal or higher video simulation quality and downstream visual planning, with better sample efficiency and robustness to real-action adapters. This co-evolutionary paradigm provides a scalable path toward generalist latent-action–based world models, with potential for broader vision-language-latent-action extensions.

Abstract

Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.

Paper Structure

This paper contains 25 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) Prior works use a two-stage pipeline: learn a latent action model (LAM), then fix it to train the world model. (b) We propose a one-stage pipeline, directly using the world model as the forward dynamics model and backpropagating gradients through latent actions.
  • Figure 2: Latent action codebook metrics during joint training of the IDM and world model. "rand" indicates random initialization, while "pre" indicates initialization from pre-trained weights. The dashed line shows the codebook metrics of the pre-trained IDM. All three subplots share the same legend, shown only in the middle panel for clarity.
  • Figure 3: Latent action codebook metrics during warm-up and joint training. Different blue curves correspond to IDM initializations from warm-up checkpoints at various steps. All three subplots share the same legend, shown only in the middle panel for clarity.
  • Figure 4: Evidence of synergistic co-evolution. The LAM's probing loss drops faster when the world model is co-evolving (a), while the world model achieves higher video prediction performance as the LAM improves (b).
  • Figure 5: Codebook metrics in different training and adaptation stages. All subplots share the same legend, shown only in the middle panel for clarity.
  • ...and 1 more figures