ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang; Shuang Zeng; Tong Lin; Xinyuan Chang; Dekang Qi; Junjin Xiao; Haoyun Liu; Ronghan Chen; Yuzhi Chen; Dongjie Huo; Feng Xiong; Xing Wei; Zhiheng Ma; Mu Xu

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu

TL;DR

ABot-M0 tackles fragmentation in robotic perception and control by constructing UniACT-dataset from six open VLA sources and introducing Action Manifold Learning (AML) with a Diffusion Transformer to directly predict action sequences in the end-effector frame. A two-stage training regime (large-scale pre-training followed by space-aware fine-tuning) and a plug-in 3D perception module enable cross-embodiment generalization without proprietary data. Empirical results across LIBERO, LIBERO-Plus, RoboCasa, and RoboTwin show state-of-the-art performance and strong generalization, validating the Action Manifold Hypothesis and the benefits of modular perception. The work emphasizes reproducibility and openness, providing open-source pipelines to accelerate community-driven progress toward general-purpose embodied intelligence. Overall, ABot-M0 demonstrates that careful data curation, a manifold-based action predictor, and 3D-aware perception can yield robust cross-embodiment robotics without bespoke hardware or private datasets.

Abstract

Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

TL;DR

Abstract

Paper Structure (35 sections, 4 equations, 11 figures, 10 tables)

This paper contains 35 sections, 4 equations, 11 figures, 10 tables.

Introduction
Dataset
Analysis of Open-Source Datasets
Data Cleaning and Preprocessing
Standardization of Data Formats
The ABot-M0 Model
Model Architecture
Visual Language Model Backbone
Action Manifold Learning
Inference Process
Two-Stage Training Paradigm
Stage 1: Large-Scale Pre-training for Generalizable Action Priors
Stage 2: Space-Aware Supervised Fine-Tuning via Knowledge Injection
Pre-Training
Sampling Ratio for Multi-Embodiment Learning
...and 20 more sections

Figures (11)

Figure 1: Data cleaning and preprocessing pipeline to construct the UniACT-dataset.
Figure 2: Overview of the integrated UniACT-dataset, which contains more than six million trajectories in 9500+ hours with 20+ unique robot embodiments.
Figure 3: Model architecture of ABot-M0. We employ a two-component architecture consisting of a VLM and an action expert. In addition, we utilize action manifold learning to predict actions with two-stage training paradigm. We then carefully select VLM features and further introduce an optional 3D module and to enhance spatial reasoning.
Figure 4: Action Manifold. (a) We posit that a meaningful action sequence is a highly structured entity residing on a low-dimensional action manifold. The conventional prediction targets of noise or velocity are inherently high-dimensional and off-manifold, which increase the burden of model learning and lead to unreasonble action. (b) We propose to predict the action directly rather than velocity, which enables the model to focus on learning the intrinsic structure and semantics of actions.
Figure 5: Embodiment distribution under different sampling strategies. Data are drawn from OXE o2024oxe, AgiBot-Beta agibotbeta, and RoboCoin wu2025robocoin. The mixture receipe of single-arm data from OXE follows OpenVLA kim2024openvla and is fixed among these three strategies. We compare (a) Trajectory-Uniform, (b) Task-Uniform, and (c) Embodiment-Uniform sampling strategies and show the data distribution among different dataset and embodiments.
...and 6 more figures

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

TL;DR

Abstract

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)