Table of Contents
Fetching ...

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

Nithin Gopalakrishnan Nair, Jeya Maria Jose Valanarasu, Vishal M Patel

TL;DR

MaxFusion tackles the scalability of diffusion-based T2I models to multiple conditioning modalities without retraining. It introduces a variance-based, training-free fusion mechanism that merges intermediate features from separately trained task models at inference time, guided by variance maps to gauge conditioning strength. The method is plug-in and compatible with off-the-shelf modules like ControlNet and T2I-Adapter, and it scales to more than two modalities including style conditioning. Experiments on synthetic COCO-derived data show improved generation quality and conditioning consistency for contradictory and complementary multimodal inputs, highlighting practical potential for zero-shot multi-modal generation.

Abstract

Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation as well as spatially conditioned image generation. For most applications, we can train the model end-toend with paired data to obtain photorealistic generation quality. However, to add an additional task, one often needs to retrain the model from scratch using paired data across all modalities to retain good generation performance. In this paper, we tackle this issue and propose a novel strategy to scale a generative model across new tasks with minimal compute. During our experiments, we discovered that the variance maps of intermediate feature maps of diffusion models capture the intensity of conditioning. Utilizing this prior information, we propose MaxFusion, an efficient strategy to scale up text-to-image generation models to accommodate new modality conditions. Specifically, we combine aligned features of multiple models, hence bringing a compositional effect. Our fusion strategy can be integrated into off-the-shelf models to enhance their generative prowess.

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

TL;DR

MaxFusion tackles the scalability of diffusion-based T2I models to multiple conditioning modalities without retraining. It introduces a variance-based, training-free fusion mechanism that merges intermediate features from separately trained task models at inference time, guided by variance maps to gauge conditioning strength. The method is plug-in and compatible with off-the-shelf modules like ControlNet and T2I-Adapter, and it scales to more than two modalities including style conditioning. Experiments on synthetic COCO-derived data show improved generation quality and conditioning consistency for contradictory and complementary multimodal inputs, highlighting practical potential for zero-shot multi-modal generation.

Abstract

Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation as well as spatially conditioned image generation. For most applications, we can train the model end-toend with paired data to obtain photorealistic generation quality. However, to add an additional task, one often needs to retrain the model from scratch using paired data across all modalities to retain good generation performance. In this paper, we tackle this issue and propose a novel strategy to scale a generative model across new tasks with minimal compute. During our experiments, we discovered that the variance maps of intermediate feature maps of diffusion models capture the intensity of conditioning. Utilizing this prior information, we propose MaxFusion, an efficient strategy to scale up text-to-image generation models to accommodate new modality conditions. Specifically, we combine aligned features of multiple models, hence bringing a compositional effect. Our fusion strategy can be integrated into off-the-shelf models to enhance their generative prowess.
Paper Structure (20 sections, 7 equations, 15 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 7 equations, 15 figures, 4 tables, 2 algorithms.

Figures (15)

  • Figure 1: Non cherry picked examples for multimodal generation with 3 modalities. $\{Depth,HED,Canny\}$ Text Prompt is "Background: A living room, Foreground: a dog near a teddy bear"
  • Figure 2: Figure illustrating variance maps of intermediate features of encoder and decoder for the text prompt "An astronaut riding a horse" for the $5^{th}$ timestep of diffusion.
  • Figure 2: Non cherry picked examples for multimodal generation with 3 modalities. $\{Depth,HED,Pose\}$ Text Prompt is "Background: A park, Foreground: a dog near a person"
  • Figure 3: Variance Maps across channels for intermediate features of ControlNet for different modalities. As we can see the variance map has high values where the condition is present and has low values for locations where the condition is absent.
  • Figure 3: Non cherry picked examples for multimodal generation with 3 modalities. $\{HED,Depth,Pose\}$ Text Prompt is "Background: A mountain, Foreground: a teddy bear near a person"
  • ...and 10 more figures