Table of Contents
Fetching ...

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Zeyu Wang, Jingyu Lin, Yifei Qian, Yi Huang, Shicen Tian, Bosong Chai, Juncan Deng, Qu Yang, Lan Du, Cunjian Chen, Kejie Huang

TL;DR

This paper introduces a novel diffusion model for general layout-guided cross-modal generation, called DiffX, which is the first model for layout-guided cross-modal image generation and introduces the Joint-Modality Embedder (JME) to enhance the interaction between layout and text conditions by incorporating a gated attention mechanism.

Abstract

Diffusion models have made significant strides in language-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, such as chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal generation, called DiffX. Notably, our DiffX presents a compact and effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space. Moreover, we introduce the Joint-Modality Embedder (JME) to enhance the interaction between layout and text conditions by incorporating a gated attention mechanism. To facilitate the user-instructed training, we construct the cross-modal image datasets with detailed text captions by the Large-Multimodal Model (LMM) and our human-in-the-loop refinement. Through extensive experiments, our DiffX demonstrates robustness in cross-modal ''RGB+X'' image generation on FLIR, MFNet, and COME15K datasets, guided by various layout conditions. Meanwhile, it shows the strong potential for the adaptive generation of ``RGB+X+Y(+Z)'' images or more diverse modalities on FLIR, MFNet, COME15K, and MCXFace datasets. To our knowledge, DiffX is the first model for layout-guided cross-modal image generation. Our code and constructed cross-modal image datasets are available at https://github.com/zeyuwang-zju/DiffX.

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

TL;DR

This paper introduces a novel diffusion model for general layout-guided cross-modal generation, called DiffX, which is the first model for layout-guided cross-modal image generation and introduces the Joint-Modality Embedder (JME) to enhance the interaction between layout and text conditions by incorporating a gated attention mechanism.

Abstract

Diffusion models have made significant strides in language-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, such as chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal generation, called DiffX. Notably, our DiffX presents a compact and effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space. Moreover, we introduce the Joint-Modality Embedder (JME) to enhance the interaction between layout and text conditions by incorporating a gated attention mechanism. To facilitate the user-instructed training, we construct the cross-modal image datasets with detailed text captions by the Large-Multimodal Model (LMM) and our human-in-the-loop refinement. Through extensive experiments, our DiffX demonstrates robustness in cross-modal ''RGB+X'' image generation on FLIR, MFNet, and COME15K datasets, guided by various layout conditions. Meanwhile, it shows the strong potential for the adaptive generation of ``RGB+X+Y(+Z)'' images or more diverse modalities on FLIR, MFNet, COME15K, and MCXFace datasets. To our knowledge, DiffX is the first model for layout-guided cross-modal image generation. Our code and constructed cross-modal image datasets are available at https://github.com/zeyuwang-zju/DiffX.
Paper Structure (32 sections, 13 equations, 13 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 13 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Illustration of our cross-modal generative pipeline in modality-shared latent space. Here, the RGB+X modal encoding is employed for illustration, while the framework is capable of supporting additional modal inputs and outputs, such as RGB+X+Y and RGB+X+Y+Z.
  • Figure 2: Illustration of the layout-guided cross-modal generation. The shown "RGB+X+(Y)" images are generated by our proposed DiffX model. In addition, our DiffX can also generate "RGB+X+Y+Z" images or more modalities. The X and Y images shown here can be Thermal (T) images or Depth (D) images.
  • Figure 3: (a) The process of the human-in-the-loop method to construct the image captions. (b) Examples of cross-modal images, labels, and prepared captions. Within the layout types, Seg. denotes Semantic Segmentation map, and SOD denotes Salient Object Detection map.
  • Figure 4: The workflow of our DiffX model for cross-modal generation. It performs the diffusion and denoising processes in the modality-shared space. Finally, the denoised feature $\boldsymbol{z}_0$ is decoded by the multi-path decoders into the cross-modal images in specific distributions.
  • Figure 5: Workflow of our Multi-Path Variational AutoEncoder (MP-VAE). Here, the RGB+X modal encoding is employed for illustration, while it also supports additional modal encoding, such as RGB+X+Y and RGB+X+Y+Z.
  • ...and 8 more figures