Table of Contents
Fetching ...

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Yucheng Han, Rui Wang, Chi Zhang, Juntao Hu, Pei Cheng, Bin Fu, Hanwang Zhang

TL;DR

EMMA introduces a multi-modal prompting framework for diffusion-based image generation built on the ELLA model. It uses Assemblable Gated Perceiver Resampler (AGPR) blocks to inject non-text modalities via cross-attention while freezing the base model, enabling easy integration with existing diffusion systems. The approach supports composing multiple modalities at inference without additional training and is compatible with Stable Diffusion-based pipelines. Empirical results on common object and portrait datasets demonstrate high fidelity and robust multi-modal control, with visualizations illustrating effective modular fusion and potential for broader applications including personalized storytelling and video generation.

Abstract

Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

TL;DR

EMMA introduces a multi-modal prompting framework for diffusion-based image generation built on the ELLA model. It uses Assemblable Gated Perceiver Resampler (AGPR) blocks to inject non-text modalities via cross-attention while freezing the base model, enabling easy integration with existing diffusion systems. The approach supports composing multiple modalities at inference without additional training and is compatible with Stable Diffusion-based pipelines. Empirical results on common object and portrait datasets demonstrate high fidelity and robust multi-modal control, with visualizations illustrating effective modular fusion and potential for broader applications including personalized storytelling and video generation.

Abstract

Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.
Paper Structure (27 sections, 8 equations, 8 figures, 2 tables)

This paper contains 27 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: EMMA could compose multiple multi-modal conditions (on the top left branch) without further finetuning, while still maintaining strong text control over the generated results (bottom branch). Furthermore, EMMA could combine various existing diffusion models in communities without training.
  • Figure 2: The model architecture of our proposed EMMA. (a) The framework of our EMMA. (b) The architecture of the Perceiver Resampler block proposed in ELLA hu2024ella (c) The architecture of our Assemblable Gated Perceiver Resampler block. The orange part is the novel part introduced in our AGPR block compared with the Perceiver Resampler block. (d) The pipeline of the composite process.
  • Figure 3: Images generated by our EMMA with portrait conditions. Two sets of images are generated for two separate stories. The first set of images is about a mailing woman chased by a dog. The second set of images is about a man finding treasures.
  • Figure 4: Visualization for our EMMA's generalization ability under different conditions. Each column shares the same text prompts. We show three kinds of conditions. The first row shows the results when there is only the text condition. The second row shows the results under multi-modal conditions, such as text plus face conditions and text plus portrait conditions. The bottom row shows the results under composite conditions.
  • Figure 5: Visualization for gate values under different conditions. The horizontal axis is the token index, while the vertical axis is the depth of the Layer. We found that the gate values show sparsity features in different layers. We also found that models trained under different conditions pay attention to different tokens, which is the basis of module composition.
  • ...and 3 more figures