S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Haodong Yan; Zhide Zhong; Jiaguan Zhu; Junjie He; Weilin Yuan; Wenxuan Song; Xin Gong; Yingjie Cai; Guanyi Zhao; Xu Yan; Bingbing Liu; Ying-Cong Chen; Haoang Li

S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, Haoang Li

Abstract

Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/

S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Abstract

Paper Structure (16 sections, 9 equations, 5 figures, 4 tables)

This paper contains 16 sections, 9 equations, 5 figures, 4 tables.

Introduction
Related Works
Vision-Language-Action Models
Video Generation Models for Robot Learning
Vision Foundation Models in Robot Learning
Method
Preliminaries
Geometric and Semantic Foresight Distillation
Action Expert
Experiment
Experimental Setup
Evaluations Results on Simulated Benchmarks (Q1)
Ablation Study (Q2)
Alternative Vision Foundation Representations (Q3)
Multi-task Experiments in the Real World (Q4)
...and 1 more sections

Figures (5)

Figure 1: Motivation and Overview of our Shortcut Video-Action Model. (a) Current video-action models struggle with a trade-off: one-step feature extraction is fast but yields noisy and entangled representations, whereas multi-step video generation predicts precise future states but is too slow for real-time control. (b) To address this, we propose a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Specifically, we introduce a self-distillation strategy that extracts vision foundation model (VFM) representations (DPAv3 lin2025depth and DINOv2 oquab2023dinov2) from the diffusion model's own multi-step generated videos to serve as teacher supervision exclusively during training (dashed path). By employing lightweight decouplers as students to map entangled one-step features to these geometry and semantics-oriented targets, our approach condenses structured generative priors of multi-step denoising into one-step inference, thereby enabling real-time and precise action prediction.
Figure 2: Architecture of our S-VAM. The core technical novelty lies in establishing a shortcut that bypasses the prohibitive latency of iterative video generation. Specifically, specialized decouplers disentangle highly entangled one-step diffusion features into coherent geometric and semantic foresight. This foresight is then aggregated with original features by a Uni-Perceiver, providing a holistic condition context for the downstream diffusion policy to predict precise robot action.
Figure 3: Qualitative comparison on CALVIN mees2022calvin. VPP huvideo utilizes entangled one-step features, resulting in an erratic attention trajectory that explicitly contradicts the language instruction and leads to failed actuation. In contrast, our S-VAM foresees geometric and semantic representations. This decoupled future blueprint enables the action expert to anchor a coherent attention trajectory that perfectly aligns with the language instruction, thereby ensuring successful execution. Note: Geometric foresight is visualized as probe-based depth mapsli2025spatialwu2026geometry; one-step and semantic features are visualized via PCA computed globally across the entire sequence.
Figure 4: Qualitative comparison on MetaWorld yu2020meta. VPP huvideo utilizes entangled one-step features, resulting in a diverging attention trajectory that completely misses the target "nut". In contrast, our S-VAM foresees explicit geometric and semantic representations. This decoupled future blueprint enables the action expert to anchor a coherent attention trajectory that accurately guides to the instructed target, ensuring successful grasping. Note: Visualization settings follow \ref{['fig:qualitative_viz']}.
Figure 5: Multi-task real-world experiments. (a) We deploy our system on a dual-arm Cobot using only monocular front-camera observations. (b) Our S-VAM demonstrates a significant success rate improvement over VPP huvideo on all tasks without compromising real-time control capabilities.

S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Abstract

S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Authors

Abstract

Table of Contents

Figures (5)