Table of Contents
Fetching ...

MMGen: Unified Multi-modal Image Generation and Understanding in One Go

Jiepeng Wang, Zhaoqing Wang, Hao Pan, Yuan Liu, Dongdong Yu, Changhu Wang, Wenping Wang

TL;DR

MMGen introduces a unified diffusion framework that performs multi-modal generation and understanding within a single diffusion process.It uses a MM encoding/decoding pipeline with a diffusion transformer and a modality-decoupling strategy, enabling category-conditioned generation, conditioned generation, and cross-modal understanding across RGB, depth, normal, and segmentation.Training employs velocity loss with modality dropout and a representation-alignment regularization based on DINOv2 features, achieving competitive results with improved efficiency relative to modality-specific baselines, and enabling applications like image-to-image translation and 3D reconstruction.While promising, the approach relies on pseudo-labels and limited data, suggesting future work in scaling data, refining supervision, and expanding modality coverage.

Abstract

A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.

MMGen: Unified Multi-modal Image Generation and Understanding in One Go

TL;DR

MMGen introduces a unified diffusion framework that performs multi-modal generation and understanding within a single diffusion process.It uses a MM encoding/decoding pipeline with a diffusion transformer and a modality-decoupling strategy, enabling category-conditioned generation, conditioned generation, and cross-modal understanding across RGB, depth, normal, and segmentation.Training employs velocity loss with modality dropout and a representation-alignment regularization based on DINOv2 features, achieving competitive results with improved efficiency relative to modality-specific baselines, and enabling applications like image-to-image translation and 3D reconstruction.While promising, the approach relies on pseudo-labels and limited data, suggesting future work in scaling data, refining supervision, and expanding modality coverage.

Abstract

A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.

Paper Structure

This paper contains 37 sections, 9 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Optional network architecture design. Note that the orange boxes on the left side of the MM Diffusion block represent the input tokens for transformer diffusion.
  • Figure 2: Method overview. (1) MM Encoding: Given paired multi-modal images, we first use a shared pretrained VAE encoder to encode each modality into latent patch codes. (2) MM Diffusion: Patch codes corresponding to the same image location are grouped to form the multi-modal patch input $x^0$, which is blended with random noise to create the diffusion input $x_t$. Conditioned on timestep $y$, category label $t$ and task embedding $e_t$, the MM Diffusion model iteratively predicts the velocity, resulting in denoised multi-modal patches $x^0_d$. (3) MM Decoding: Finally, these patches are reprojected to the original image locations for each modality and decoded back into image pixels using a shared pretrained VAE decoder.
  • Figure 2: Visualization of errors in normal pseudo labels by StableNormal ye2024stablenormal. StableNormal struggles to produce accurate estimations in background regions and exhibits variations in reflective areas, such as bird eyes.
  • Figure 3: Diversity of depth-conditioned generation. Given the same depth condition, MMGen can generate diverse RGB images and other aligned modalities.
  • Figure 3: Multi-modal category-conditioned generataion.
  • ...and 9 more figures