Table of Contents
Fetching ...

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen

TL;DR

Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling.

Abstract

We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

TL;DR

Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling.

Abstract

We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.
Paper Structure (24 sections, 8 equations, 5 figures, 10 tables, 2 algorithms)

This paper contains 24 sections, 8 equations, 5 figures, 10 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of LLaDA-o's capabilities. Top: multimodal understanding examples. Middle: text-to-image generation results following complex prompts (see Table \ref{['tab:image_prompts']} for the prompts). Bottom: case by case comparison with existing omni diffusion models, where LLaDA-o achieves stronger understanding performance and generates images with richer fine-grained details following the instructions.
  • Figure 2: Overview of LLaDA-o: the mixture of diffusion framework.
  • Figure 3: Implementation of intra-modality bidirectional attention. Yellow blocks indicate unmasked attention, while dashed white boxes denote masked attention. Text sequences are explicitly partitioned into Prompts (PRM) and Responses (RES) in cases (a--b).
  • Figure 4: Comparison of inference efficiency on MathVista. We visualize the throughput-accuracy trade-off by varying the confidence threshold for LLaDA-o and the refresh interval ($n$) of Fast-dLLM applied to LLaDA-V. Our approach outperforms LLaDA-V, achieving a $5.9\times$ speedup with comparable performance.
  • Figure 5: Additional generated samples. We present 12 randomly selected images generated by LLaDA-o. For each sample, the prompt used for generation is shown below the corresponding image. All results are produced under the same setting as in the main paper.