Table of Contents
Fetching ...

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu

TL;DR

Omni-Diffusion is introduced, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images.

Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

TL;DR

Omni-Diffusion is introduced, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images.

Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.
Paper Structure (29 sections, 2 equations, 9 figures, 4 tables)

This paper contains 29 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of Omni-Diffusion. Our model takes multimodal tokens as input, processes them using a unified mask-based discrete diffusion model, and generates output data of the desired modality. By modeling the joint distribution over discrete multimodal tokens, Omni-Diffusion can handle not only bimodal tasks (e.g., automatic speech recognition, text-to-speech, visual QA, and text-to-image) but also tasks requiring the integration of more than two modalities, such as speech-to-image generation and spoken visual understanding.
  • Figure 2: Architecture overview. Omni-Diffusion is an any-to-any multimodal system built on the mask token based discrete diffusion model. By modeling a unified distribution of multimodal discrete tokens through the mask token prediction, Omni-Diffusion enables to perform comprehension and generation of various modalities, including text, image, and speech.
  • Figure 3: Training pipeline of Omni-Diffusion. The first stage pre-aligns the textual capability of pre-trained diffusion language model with the visual modality. The second stage further enhances the multimodal capability of diffusion model by jointly training on the speech and visual data. The last stage optimizes the model on our constructed SDVI datasets that consisting of speech-to-image and image-to-speech tasks, which further enhances the unified multimodal alignment of our model across various modality.
  • Figure 4: Generated samples of Omni-Diffusion on spoken interaction with visual content.
  • Figure 5: Generated samples of Omni-Diffusion on text-to-image and speech-to-image tasks.
  • ...and 4 more figures