Table of Contents
Fetching ...

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng

TL;DR

Problem: Vision-Language-Action (VLA) models struggle when learning from high-dimensional visual states while maintaining robust language grounding. Approach: Mantis introduces Disentangled Visual Foresight (DVF) with a diffusion Transformer head and latent-action queries to decouple visual foresight from action learning, aided by a residual visual pathway to preserve current state for grounding; training proceeds in three stages to balance vision, action, and language signals. Contributions: DVF provides concise, predictive cues that enhance action learning; Adaptive Temporal Ensemble (ATE) improves inference efficiency; large-scale pretraining on SSV2, DROID, and image-text data yields strong LIBERO results (96.7%) and superior real-world instruction-following and generalization. Impact: enables robust, language-guided robotic manipulation with faster convergence and practical, open-source deployment potential.

Abstract

Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

TL;DR

Problem: Vision-Language-Action (VLA) models struggle when learning from high-dimensional visual states while maintaining robust language grounding. Approach: Mantis introduces Disentangled Visual Foresight (DVF) with a diffusion Transformer head and latent-action queries to decouple visual foresight from action learning, aided by a residual visual pathway to preserve current state for grounding; training proceeds in three stages to balance vision, action, and language signals. Contributions: DVF provides concise, predictive cues that enhance action learning; Adaptive Temporal Ensemble (ATE) improves inference efficiency; large-scale pretraining on SSV2, DROID, and image-text data yields strong LIBERO results (96.7%) and superior real-world instruction-following and generalization. Impact: enables robust, language-guided robotic manipulation with faster convergence and practical, open-source deployment potential.

Abstract

Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms , a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

Paper Structure

This paper contains 19 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Vision-augmented action learning paradigms. (a) Visual Foresight enhances action prediction by forecasting future frames. (b) Track Guidance employs compressed visual state representations to guide action prediction. (c) Latent Action Supervision improves action learning through auxiliary latent actions.
  • Figure 2: Left:Progressive training recipe. Mantis progressively integrates multiple modalities to achieve stable and well-balanced optimization. Center:Overview of Mantis. The framework consists of a backbone network, a DVF head, and an action head. The DVF head predicts future frames to facilitate latent action learning, thereby improving action prediction. Language supervision helps maintain the backbone’s capability for understanding and reasoning. Right:Adaptive Temporal Ensemble. Mantis-ATE dynamically adjusts the ensemble strength based on the overlap between target tokens and dynamic tokens.
  • Figure 3: Visualization of multi-gap future frame generation.
  • Figure 4: Visualization of ATE. The attention heatmap uses darker colors to represent higher values, whereas in the cosine similarity heatmap the opposite holds. The parameters are set as $\tau_\text{target} = 1$ and $\tau_\text{dynamic} = 12$.
  • Figure 5: Convergence speed comparison. Compared with traditional visual foresight methods such as UnifiedVLA wang2025unified, Mantis achieves significantly faster convergence speed, underscoring the necessity of decoupling foresight prediction from action learning.
  • ...and 5 more figures