Table of Contents
Fetching ...

Dexbotic: Open-Source Vision-Language-Action Toolbox

Bin Xie, Erjin Zhou, Fan Jia, Hao Shi, Haoqiang Fan, Haowei Zhang, Hebei Li, Jianjian Sun, Jie Bin, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Lin Sun, Meng Zhang, Peilong Han, Ruitao Hao, Ruitao Zhang, Saike Huang, Songhan Xie, Tiancai Wang, Tianle Liu, Wenbin Tang, Wenqi Zhu, Yang Chen, Yingfei Liu, Yizhuang Zhou, Yu Liu, Yucheng Zhao, Yunchao Ma, Yunfei Wei, Yuxiang Chen, Ze Chen, Zeming Li, Zhao Wu, Ziheng Zhang, Ziming Liu, Ziwei Yan, Ziyu Zhang

TL;DR

Dexbotic addresses fragmentation in Vision-Language-Action (VLA) research by providing an open-source, PyTorch-based toolbox that unifies diverse VLA policies under a single framework, enabling fair comparisons and scalable experimentation. It introduces the DexboticVLM foundation model and the Dexdata data format to support both discrete and continuous action representations via layered components and action experts, with discretization into $256$ bins. An experiment-centric workflow with base_exp and Exp scripts accelerates development, while pretrained models like Dexbotic-Base and Dexbotic-CogACT boost performance across policies including $π_0$ and CogACT across multiple simulators. A Real2Sim protocol (DOS-Twins) and extensive benchmarks, together with real-world demonstrations, support reliable sim-to-real transfer and practical deployments, offering a scalable, open-source path for fair policy evaluation and deployment.

Abstract

In this paper, we present Dexbotic, an open-source Vision-Language-Action (VLA) model toolbox based on PyTorch. It aims to provide a one-stop VLA research service for professionals in the field of embodied intelligence. It offers a codebase that supports multiple mainstream VLA policies simultaneously, allowing users to reproduce various VLA methods with just a single environment setup. The toolbox is experiment-centric, where the users can quickly develop new VLA experiments by simply modifying the Exp script. Moreover, we provide much stronger pretrained models to achieve great performance improvements for state-of-the-art VLA policies. Dexbotic will continuously update to include more of the latest pre-trained foundation models and cutting-edge VLA models in the industry.

Dexbotic: Open-Source Vision-Language-Action Toolbox

TL;DR

Dexbotic addresses fragmentation in Vision-Language-Action (VLA) research by providing an open-source, PyTorch-based toolbox that unifies diverse VLA policies under a single framework, enabling fair comparisons and scalable experimentation. It introduces the DexboticVLM foundation model and the Dexdata data format to support both discrete and continuous action representations via layered components and action experts, with discretization into bins. An experiment-centric workflow with base_exp and Exp scripts accelerates development, while pretrained models like Dexbotic-Base and Dexbotic-CogACT boost performance across policies including and CogACT across multiple simulators. A Real2Sim protocol (DOS-Twins) and extensive benchmarks, together with real-world demonstrations, support reliable sim-to-real transfer and practical deployments, offering a scalable, open-source path for fair policy evaluation and deployment.

Abstract

In this paper, we present Dexbotic, an open-source Vision-Language-Action (VLA) model toolbox based on PyTorch. It aims to provide a one-stop VLA research service for professionals in the field of embodied intelligence. It offers a codebase that supports multiple mainstream VLA policies simultaneously, allowing users to reproduce various VLA methods with just a single environment setup. The toolbox is experiment-centric, where the users can quickly develop new VLA experiments by simply modifying the Exp script. Moreover, we provide much stronger pretrained models to achieve great performance improvements for state-of-the-art VLA policies. Dexbotic will continuously update to include more of the latest pre-trained foundation models and cutting-edge VLA models in the industry.

Paper Structure

This paper contains 25 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The overall architecture of Dexbotic. It introduces the Dexdata format to unify different embodiments. In Model Layer, Dexbotic integrates the open-source vision encoder, LLM and action expert through a unified modular VLA framework. Based on the provided DexboticVLMs, users can develop existing VLA policies and custom policies. Based on the developed policies, we further propose the Experiment Layer for fast development. Both the training pipeline and inference service are supported on some cloud service and customer GPUs.
  • Figure 2: The overall architecture of Dexbotic. The framework is organized into three core layers including the data, model and experiment layers, that work together to provide a complete solution for training and serving VLA models.
  • Figure 3: The overview of Dexdata format.
  • Figure 4: The layered configuration architecture in experiment layer. Each experiment class includes the configurations on trainer, data, optimizer, model and inference.
  • Figure 5: The training pipeline of Dexbotic.
  • ...and 3 more figures