Table of Contents
Fetching ...

dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yicun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, Yi Xu

TL;DR

dVLA tackles open-world robotic instruction following by unifying visual perception, textual reasoning, and action execution within a single diffusion-based model. It introduces a multimodal chain-of-thought training paradigm and modality-specific tokenizers, coupled with a shared discrete diffusion objective that reconstructs masked tokens across vision, language, and actions. Empirically, it achieves state-of-the-art performance on the LIBERO benchmark (average SR of $96.4\%$) and robust real-world results on a Franka arm, while two inference accelerations (prefix attention mask and KV caching) provide up to ~2x speedups with minimal accuracy loss. The work demonstrates the practical viability and interpretability benefits of unified diffusion models for high-performance VLA robotics and paves the way for further integration of multimodal reasoning and control.

Abstract

Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understanding, and action under a single diffusion objective, enabling stronger cross-modal reasoning and better generalization to novel instructions and objects. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies, a prefix attention mask and KV caching, yielding up to around times speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark, it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics.

dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

TL;DR

dVLA tackles open-world robotic instruction following by unifying visual perception, textual reasoning, and action execution within a single diffusion-based model. It introduces a multimodal chain-of-thought training paradigm and modality-specific tokenizers, coupled with a shared discrete diffusion objective that reconstructs masked tokens across vision, language, and actions. Empirically, it achieves state-of-the-art performance on the LIBERO benchmark (average SR of ) and robust real-world results on a Franka arm, while two inference accelerations (prefix attention mask and KV caching) provide up to ~2x speedups with minimal accuracy loss. The work demonstrates the practical viability and interpretability benefits of unified diffusion models for high-performance VLA robotics and paves the way for further integration of multimodal reasoning and control.

Abstract

Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understanding, and action under a single diffusion objective, enabling stronger cross-modal reasoning and better generalization to novel instructions and objects. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies, a prefix attention mask and KV caching, yielding up to around times speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark, it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics.

Paper Structure

This paper contains 12 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The architecture of dVLA. We adopt a discrete diffusion language model as a backbone and separate tokenizers for each modality.
  • Figure 2: Examples of multimodal Chain-of-Thought on real robot tasks.
  • Figure 3: The experiment setup and real-world task suite.
  • Figure 4: Qualitative results on LIBERO simulation. Top: The successful execution results. Bottom: Failure execution results and corresponding visual CoT.