Dexbotic: Open-Source Vision-Language-Action Toolbox

Bin Xie; Erjin Zhou; Fan Jia; Hao Shi; Haoqiang Fan; Haowei Zhang; Hebei Li; Jianjian Sun; Jie Bin; Junwen Huang; Kai Liu; Kaixin Liu; Kefan Gu; Lin Sun; Meng Zhang; Peilong Han; Ruitao Hao; Ruitao Zhang; Saike Huang; Songhan Xie; Tiancai Wang; Tianle Liu; Wenbin Tang; Wenqi Zhu; Yang Chen; Yingfei Liu; Yizhuang Zhou; Yu Liu; Yucheng Zhao; Yunchao Ma; Yunfei Wei; Yuxiang Chen; Ze Chen; Zeming Li; Zhao Wu; Ziheng Zhang; Ziming Liu; Ziwei Yan; Ziyu Zhang

Dexbotic: Open-Source Vision-Language-Action Toolbox

Bin Xie, Erjin Zhou, Fan Jia, Hao Shi, Haoqiang Fan, Haowei Zhang, Hebei Li, Jianjian Sun, Jie Bin, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Lin Sun, Meng Zhang, Peilong Han, Ruitao Hao, Ruitao Zhang, Saike Huang, Songhan Xie, Tiancai Wang, Tianle Liu, Wenbin Tang, Wenqi Zhu, Yang Chen, Yingfei Liu, Yizhuang Zhou, Yu Liu, Yucheng Zhao, Yunchao Ma, Yunfei Wei, Yuxiang Chen, Ze Chen, Zeming Li, Zhao Wu, Ziheng Zhang, Ziming Liu, Ziwei Yan, Ziyu Zhang

TL;DR

Dexbotic addresses fragmentation in Vision-Language-Action (VLA) research by providing an open-source, PyTorch-based toolbox that unifies diverse VLA policies under a single framework, enabling fair comparisons and scalable experimentation. It introduces the DexboticVLM foundation model and the Dexdata data format to support both discrete and continuous action representations via layered components and action experts, with discretization into $256$ bins. An experiment-centric workflow with base_exp and Exp scripts accelerates development, while pretrained models like Dexbotic-Base and Dexbotic-CogACT boost performance across policies including $π_0$ and CogACT across multiple simulators. A Real2Sim protocol (DOS-Twins) and extensive benchmarks, together with real-world demonstrations, support reliable sim-to-real transfer and practical deployments, offering a scalable, open-source path for fair policy evaluation and deployment.

Abstract

In this paper, we present Dexbotic, an open-source Vision-Language-Action (VLA) model toolbox based on PyTorch. It aims to provide a one-stop VLA research service for professionals in the field of embodied intelligence. It offers a codebase that supports multiple mainstream VLA policies simultaneously, allowing users to reproduce various VLA methods with just a single environment setup. The toolbox is experiment-centric, where the users can quickly develop new VLA experiments by simply modifying the Exp script. Moreover, we provide much stronger pretrained models to achieve great performance improvements for state-of-the-art VLA policies. Dexbotic will continuously update to include more of the latest pre-trained foundation models and cutting-edge VLA models in the industry.

Dexbotic: Open-Source Vision-Language-Action Toolbox

TL;DR

Abstract

Dexbotic: Open-Source Vision-Language-Action Toolbox

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)