Table of Contents
Fetching ...

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

Mingyu Liu, Jiuhe Shu, Hui Chen, Zeju Li, Canyu Zhao, Jiange Yang, Shenyuan Gao, Hao Chen, Chunhua Shen

TL;DR

StaMo tackles the challenge of learning expressive yet compact state representations for robotic world modeling by encoding observations from static images into two compact tokens (two 1024-d tokens) using a Diffusion Autoencoder with a frozen DINOv2 encoder and a DiT decoder. Remarkably, motion emerges as the difference between consecutive tokens, enabling simple linear interpolation to generate smooth, executable latent motions without relying on video-based action learning. The approach integrates seamlessly with Vision-Language-Action models to improve world modeling and supports latent-motion co-training, achieving strong improvements on LIBERO and real-world tasks, while maintaining negligible inference overhead. The results show strong generalization and sim-to-real transfer across diverse data sources, signaling scalable unsupervised skill discovery that bridges perception and action in robotic systems.

Abstract

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

TL;DR

StaMo tackles the challenge of learning expressive yet compact state representations for robotic world modeling by encoding observations from static images into two compact tokens (two 1024-d tokens) using a Diffusion Autoencoder with a frozen DINOv2 encoder and a DiT decoder. Remarkably, motion emerges as the difference between consecutive tokens, enabling simple linear interpolation to generate smooth, executable latent motions without relying on video-based action learning. The approach integrates seamlessly with Vision-Language-Action models to improve world modeling and supports latent-motion co-training, achieving strong improvements on LIBERO and real-world tasks, while maintaining negligible inference overhead. The results show strong generalization and sim-to-real transfer across diverse data sources, signaling scalable unsupervised skill discovery that bridges perception and action in robotic systems.

Abstract

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.

Paper Structure

This paper contains 21 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: An overview of our StaMo framework. Our method efficiently compresses and encodes robotic visual representations, enabling the learning of a compact state representation. Motion naturally emerges as the difference between these states in the highly compressed token space. This approach facilitates efficient world modeling and demonstrates strong generalization, with the potential to scale up with more data.
  • Figure 2: Where is StaMo? This figure visualizes how different robotic representations fall on the spectrum of expressiveness versus compactness. StaMo uniquely occupies the ideal position, offering both a rich, expressive state representation and the ability to model motion from a highly compact space.
  • Figure 3: Reconstruct images using our StaMo encoder with as few as two 1024-dimensional tokens. The first row shows the ground truth, and the second row shows the predicted results, with corresponding PSNR and SSIM metrics listed below. The results demonstrate that StaMo can preserve high image fidelity and structural similarity even under extremely compressed state representations.
  • Figure 4: Linear Probing MSE results. We compare our method against three baselines. Our method consistently achieves the lowest MSE across all horizons.
  • Figure 5: Scaling Performance. The Performace of our model can be steadly scaling with more data, including human ego-centric data.
  • ...and 4 more figures