Table of Contents
Fetching ...

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang

TL;DR

The paper addresses the performance gap between diffusion-based vision-language models and autoregressive VLMs by proposing DiffusionVL, a framework that translates any powerful autoregressive model into a diffusion vision-language model through diffusion finetuning. It demonstrates two pathways: converting AR-VLMs directly and adapting AR-LMs via a vision-language connector followed by diffusion finetuning, augmented by a block-diffusion scheme that enables arbitrary-length generation and KV-cache reuse. Empirically, DiffusionVL achieves state-of-the-art results among diffusion VLMs using less than 5% of prior data, and attains up to 2x inference speedups, while continuing to close the gap with AR-VLMs. The work provides a data-efficient, architecture-agnostic route to high-performance multimodal models with practical inference advantages, and it releases code for broader adoption.

Abstract

In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

TL;DR

The paper addresses the performance gap between diffusion-based vision-language models and autoregressive VLMs by proposing DiffusionVL, a framework that translates any powerful autoregressive model into a diffusion vision-language model through diffusion finetuning. It demonstrates two pathways: converting AR-VLMs directly and adapting AR-LMs via a vision-language connector followed by diffusion finetuning, augmented by a block-diffusion scheme that enables arbitrary-length generation and KV-cache reuse. Empirically, DiffusionVL achieves state-of-the-art results among diffusion VLMs using less than 5% of prior data, and attains up to 2x inference speedups, while continuing to close the gap with AR-VLMs. The work provides a data-efficient, architecture-agnostic route to high-performance multimodal models with practical inference advantages, and it releases code for broader adoption.

Abstract

In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

Paper Structure

This paper contains 17 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Performance Comparison. Our DiffusionVL achieve state-of-the-art (SOTA) performance among diffusion vision language models including li2025lavidayou2025llada and competitive performance to Qwen2.5-VL qwen2025qwen25technicalreport.
  • Figure 2: Paradigm Shift and Modality Shift. We demonstrate that any autoregressive models with different modalities can be translated to the diffusion vision language models effectively.
  • Figure 3: The diffusion finetuning framework of our model. After the input is converted into the embedding space, block-wise noise is added to the answer text sequence within this space. The noise sequence $x_t^i$ is concatenated with the original sequence $x_0^i$ and fed into the language model. A noisy block can see information about the preceding blocks in the corresponding clean sequence (offset block causal) and other positions within the same block (block diagonal). During inference, the attention pattern of the clean sequence is used (block causal). The model performs denoising prediction and finally computes the loss at the masked noisy positions.
  • Figure 4: Balancing speed and quality for detailed image captioning. We define the parallelism factor for dVLMs as the average number of tokens generated simultaneously throughout the sequence (for instance, $1\times$ parallelism corresponds to single-token sampling). Speed metrics were collected using 8 GPUs, with results reported as the average per device.
  • Figure 5: Performance and speed between different thresholds for dynamic low-confidence remasking. By adjusting the thresholds, DiffusionVL can achieve extreme acceleration contrast to static low-confidence remasking. And it also offers a tunable balance between speed and output quality (BERTScore).