Diffusion Models For Multi-Modal Generative Modeling

Changyou Chen; Han Ding; Bunyamin Sisman; Yi Xu; Ouye Xie; Benjamin Z. Yao; Son Dinh Tran; Belinda Zeng

Diffusion Models For Multi-Modal Generative Modeling

Changyou Chen, Han Ding, Bunyamin Sisman, Yi Xu, Ouye Xie, Benjamin Z. Yao, Son Dinh Tran, Belinda Zeng

TL;DR

This work extends diffusion models to multi-modal generation by embedding heterogeneous modality data into a shared diffusion space Z via encoders and training with a multi-task ELBO that integrates forward aggregation and modality-specific decoders. The MT-Diffusion framework derives joint forward and reverse processes, enabling simultaneous generation across modalities and conditional generation when some modalities are known. Empirical results on ImageNet across tasks such as masked-image training and joint image-label generation show improved training efficiency and competitive performance with flexibility to handle multiple modalities within a single model. This approach suggests a promising direction for unified, multi-modal generative modeling with potential benefits in data efficiency and cross-task transfer learning.

Abstract

Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations.

Diffusion Models For Multi-Modal Generative Modeling

TL;DR

Abstract

Paper Structure (43 sections, 4 theorems, 20 equations, 14 figures, 4 tables, 2 algorithms)

This paper contains 43 sections, 4 theorems, 20 equations, 14 figures, 4 tables, 2 algorithms.

Introduction
Multi-Modal Diffusion Models
Preliminaries on Deniosing Diffusion Probability Models (DDPM)
Multi-Modal Diffusion Models
Forward-Reverse Processes and the Variational Lower Bound
Forward Aggregation
Reverse Parametrization
Encoder-Decoder Designs
Encoder Design
Decoder Design
Training and Inference
Training
Inference
Related Work
Diffusion-based Models
...and 28 more sections

Key Result

Theorem 1

The negative ELBO of MT-Diffusion endows: $\mathcal{L} = \mathbb{E}_{q}\left[\mathcal{L}_0 + \mathcal{L}_1 + \mathcal{L}_2 + \mathcal{L}_3\right]$, where

Figures (14)

Figure 1: Illustration of the proposed MT-Diffusion on two modalities. The diffusion process is defined in a shared diffusion space for all modality data, which are transformed from the modality-specific encoders. The forward nosing process includes a forward aggregation step that integrates information from multi-modal data, and the reverse denosing component transforms the diffusion space back to the task-specific data spaces with learnable decoders through a multi-task loss.
Figure 2: The forward (left) and reverse (right) processes of the proposed MT-Diffusion by jointly modeling a set of task data.
Figure 3: Training pipeline and encoder-decoder design choices. ①②③ indicate three possible choices for the encoder $E(\cdot)$; gray shaped boxes indicate stop gradients; and black dash lines mean possible connections to the encoder and decoder. "Aggregate" is implemented through equation \ref{['eq:marginal']}.
Figure 4: Random samples on the night2day (top, unconditional generation) and cityscape (bottom, conditional generation) datasets. The 3 pictures in each block of the cityscape dataset (bottom) correspond to the conditional image (source), the ground-truth and the inferred target image, respectively.
Figure 5: Randomly generated examples of MT-Diffusion with masked-image training. First row: image restoration from random masks; images in each block: original, masked and restored images. Second row: image restoration from half masking; each block contains two restored images to illustrate generation variance. Third row: image generation from scratch with complete masks.
...and 9 more figures

Theorems & Definitions (6)

Theorem 1
Remark 1
Theorem 2
Remark 2
Lemma 3
Lemma 4: pml1Book

Diffusion Models For Multi-Modal Generative Modeling

TL;DR

Abstract

Diffusion Models For Multi-Modal Generative Modeling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (6)