Table of Contents
Fetching ...

Diffusion Models For Multi-Modal Generative Modeling

Changyou Chen, Han Ding, Bunyamin Sisman, Yi Xu, Ouye Xie, Benjamin Z. Yao, Son Dinh Tran, Belinda Zeng

TL;DR

This work extends diffusion models to multi-modal generation by embedding heterogeneous modality data into a shared diffusion space Z via encoders and training with a multi-task ELBO that integrates forward aggregation and modality-specific decoders. The MT-Diffusion framework derives joint forward and reverse processes, enabling simultaneous generation across modalities and conditional generation when some modalities are known. Empirical results on ImageNet across tasks such as masked-image training and joint image-label generation show improved training efficiency and competitive performance with flexibility to handle multiple modalities within a single model. This approach suggests a promising direction for unified, multi-modal generative modeling with potential benefits in data efficiency and cross-task transfer learning.

Abstract

Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations.

Diffusion Models For Multi-Modal Generative Modeling

TL;DR

This work extends diffusion models to multi-modal generation by embedding heterogeneous modality data into a shared diffusion space Z via encoders and training with a multi-task ELBO that integrates forward aggregation and modality-specific decoders. The MT-Diffusion framework derives joint forward and reverse processes, enabling simultaneous generation across modalities and conditional generation when some modalities are known. Empirical results on ImageNet across tasks such as masked-image training and joint image-label generation show improved training efficiency and competitive performance with flexibility to handle multiple modalities within a single model. This approach suggests a promising direction for unified, multi-modal generative modeling with potential benefits in data efficiency and cross-task transfer learning.

Abstract

Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations.
Paper Structure (43 sections, 4 theorems, 20 equations, 14 figures, 4 tables, 2 algorithms)

This paper contains 43 sections, 4 theorems, 20 equations, 14 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

The negative ELBO of MT-Diffusion endows: $\mathcal{L} = \mathbb{E}_{q}\left[\mathcal{L}_0 + \mathcal{L}_1 + \mathcal{L}_2 + \mathcal{L}_3\right]$, where

Figures (14)

  • Figure 1: Illustration of the proposed MT-Diffusion on two modalities. The diffusion process is defined in a shared diffusion space for all modality data, which are transformed from the modality-specific encoders. The forward nosing process includes a forward aggregation step that integrates information from multi-modal data, and the reverse denosing component transforms the diffusion space back to the task-specific data spaces with learnable decoders through a multi-task loss.
  • Figure 2: The forward (left) and reverse (right) processes of the proposed MT-Diffusion by jointly modeling a set of task data.
  • Figure 3: Training pipeline and encoder-decoder design choices. ①②③ indicate three possible choices for the encoder $E(\cdot)$; gray shaped boxes indicate stop gradients; and black dash lines mean possible connections to the encoder and decoder. "Aggregate" is implemented through equation \ref{['eq:marginal']}.
  • Figure 4: Random samples on the night2day (top, unconditional generation) and cityscape (bottom, conditional generation) datasets. The 3 pictures in each block of the cityscape dataset (bottom) correspond to the conditional image (source), the ground-truth and the inferred target image, respectively.
  • Figure 5: Randomly generated examples of MT-Diffusion with masked-image training. First row: image restoration from random masks; images in each block: original, masked and restored images. Second row: image restoration from half masking; each block contains two restored images to illustrate generation variance. Third row: image generation from scratch with complete masks.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Remark 1
  • Theorem 2
  • Remark 2
  • Lemma 3
  • Lemma 4: pml1Book