Table of Contents
Fetching ...

Igniting VLMs toward the Embodied Space

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, Zach Xu

TL;DR

The paper tackles the challenge of extending vision-language foundation models into embodied space by introducing WALL-OSS, an end-to-end embodied foundation model built on a tightly coupled Mixture-of-Experts architecture. It leverages a two-stage training curriculum (Inspiration and Integration) and a Uni-CoT framework to fuse instruction reasoning, subtask decomposition, and high-frequency action synthesis within a differentiable model. Through a large, multimodal, embodiment-centric dataset and multimodal co-training, WALL-OSS achieves state-of-the-art performance on long-horizon manipulation, robust instruction following, and embodied reasoning, while preserving core VL priors. The work implies a scalable path from VLMs to embodied AI, supported by open-source code and model checkpoints, with future directions toward end-to-end and intermediate modalities for AGI.

Abstract

While foundation models show remarkable progress in language and vision, existing vision-language models (VLMs) still have limited spatial and embodiment understanding. Transferring VLMs to embodied domains reveals fundamental mismatches between modalities, pretraining distributions, and training objectives, leaving action comprehension and generation as a central bottleneck on the path to AGI. We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision-language understanding, (2) strong language-action association, and (3) robust manipulation capability. Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT-seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.

Igniting VLMs toward the Embodied Space

TL;DR

The paper tackles the challenge of extending vision-language foundation models into embodied space by introducing WALL-OSS, an end-to-end embodied foundation model built on a tightly coupled Mixture-of-Experts architecture. It leverages a two-stage training curriculum (Inspiration and Integration) and a Uni-CoT framework to fuse instruction reasoning, subtask decomposition, and high-frequency action synthesis within a differentiable model. Through a large, multimodal, embodiment-centric dataset and multimodal co-training, WALL-OSS achieves state-of-the-art performance on long-horizon manipulation, robust instruction following, and embodied reasoning, while preserving core VL priors. The work implies a scalable path from VLMs to embodied AI, supported by open-source code and model checkpoints, with future directions toward end-to-end and intermediate modalities for AGI.

Abstract

While foundation models show remarkable progress in language and vision, existing vision-language models (VLMs) still have limited spatial and embodiment understanding. Transferring VLMs to embodied domains reveals fundamental mismatches between modalities, pretraining distributions, and training objectives, leaving action comprehension and generation as a central bottleneck on the path to AGI. We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision-language understanding, (2) strong language-action association, and (3) robust manipulation capability. Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT-seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.

Paper Structure

This paper contains 23 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Current VLMs lack a sufficient understanding of space and action within embodied AI. This deficiency stems from a mismatch between the capabilities of existing pre-trained VLMs and the specific knowledge required for embodied tasks. WALL-OSS unleashes the embodied potential of VLMs, leading to enhanced embodied understanding and the ability to generate complex actions.
  • Figure 2: Different paradigms for transferring VLMs to action modeling. The blue parts denote the initialized weights inherited from the pretrained VLM backbone. DAM and CAM refer to Discrete Action Modeling and Continuous Action Modeling, respectively, while VL denotes Vision--Language, and SA denotes Self--Attention.
  • Figure 3: Architecture of WALL-OSS.
  • Figure 4: Overview of training and inference pipeline.
  • Figure 5: Overview of the multisource dataset. Left: composition across three sources (self-collected actions, open-source actions, and multimodal VQA). Middle (top to bottom): example images from self-collected actions, open-source actions, and multimodal VQA. Top-right: our representative robot hardware.
  • ...and 2 more figures