Table of Contents
Fetching ...

LLaDA-VLA: Vision Language Diffusion Action Models

Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, Xiaoyan Sun

TL;DR

This work tackles the challenge of applying diffusion-based vision-language models to robotic manipulation by introducing LLaDA-VLA, the first Vision-Language-Diffusion-Action model built on pretrained discrete Vision-Language Models. It presents two key designs—localized special-token classification and hierarchical action-structured decoding—to bridge the domain gap and enforce the structured dependencies of action sequences. Through extensive experiments on SimplerEnv, CALVIN, and a real WidowX robot, LLaDA-VLA achieves state-of-the-art performance and strong generalization to unseen tasks. The results validate diffusion-based VLMs as a viable foundation for robotic policy learning and provide practical guidance for adapting such models to action generation in robotics.

Abstract

The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.

LLaDA-VLA: Vision Language Diffusion Action Models

TL;DR

This work tackles the challenge of applying diffusion-based vision-language models to robotic manipulation by introducing LLaDA-VLA, the first Vision-Language-Diffusion-Action model built on pretrained discrete Vision-Language Models. It presents two key designs—localized special-token classification and hierarchical action-structured decoding—to bridge the domain gap and enforce the structured dependencies of action sequences. Through extensive experiments on SimplerEnv, CALVIN, and a real WidowX robot, LLaDA-VLA achieves state-of-the-art performance and strong generalization to unseen tasks. The results validate diffusion-based VLMs as a viable foundation for robotic policy learning and provide practical guidance for adapting such models to action generation in robotics.

Abstract

The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.

Paper Structure

This paper contains 20 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison between Autoregressive-based VLA Model and LLaDA-VLA.
  • Figure 2: Overview of LLaDA-VLA. (a) Overall architecture. Visual features extracted by the vision encoder are projected into the text space and concatenated with text tokens. Together with masked tokens, they are fed into a large language diffusion model to generate action sequences via Localized Special-Token Classification and further refined with Hierarchical Action-Structured Decoding. (b) Hierarchical Action-Structured Decoding strategy. Starting from a fully masked action sequence (except vision and text prompts), the model iteratively predicts masked tokens, performing action-level and token-level remasking based on confidence until the full sequence is decoded.
  • Figure 3: Qualitative results of LLaDA-VLA on CALVIN tasks.
  • Figure 4: Qualitative results of LLaDA-VLA on SimplerEnv tasks.
  • Figure 5: Qualitative results of LLaDA-VLA on real-world in-domain tasks.
  • ...and 1 more figures