Table of Contents
Fetching ...

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

Gihyun Kwon, Jong Chul Ye

TL;DR

TweedieMix addresses the challenge of generating images and videos that coherently integrate multiple personalized concepts by performing inference-time model composition in two stages: an initial content-aware sampling phase up to $t_{\text{con}}$ with a multi-object prompt and a resampling strategy, followed by a Tweedie-denoised-space fusion that regionally blends fine-tuned concept models. The method uses region-aware masks to guide concept fusion and introduces a training-free video extension via residual feature injection in an image-to-video pipeline. Experimental results show superior quantitative alignment (CLIP, DINO) and perceptual quality, along with favorable user studies, compared to strong baselines, and ablations confirm the contribution of CFG++, resampling, and denoised-space mixing. The approach offers a practical, scalable solution for multi-concept generation without weight merging or inversion steps, with broad applicability to image and video generation pipelines.

Abstract

Despite significant advancements in customizing text-to-image and video generation models, generating images and videos that effectively integrate multiple personalized concepts remains a challenging task. To address this, we present TweedieMix, a novel method for composing customized diffusion models during the inference phase. By analyzing the properties of reverse diffusion sampling, our approach divides the sampling process into two stages. During the initial steps, we apply a multiple object-aware sampling technique to ensure the inclusion of the desired target objects. In the later steps, we blend the appearances of the custom concepts in the de-noised image space using Tweedie's formula. Our results demonstrate that TweedieMix can generate multiple personalized concepts with higher fidelity than existing methods. Moreover, our framework can be effortlessly extended to image-to-video diffusion models, enabling the generation of videos that feature multiple personalized concepts. Results and source code are in our anonymous project page.

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

TL;DR

TweedieMix addresses the challenge of generating images and videos that coherently integrate multiple personalized concepts by performing inference-time model composition in two stages: an initial content-aware sampling phase up to with a multi-object prompt and a resampling strategy, followed by a Tweedie-denoised-space fusion that regionally blends fine-tuned concept models. The method uses region-aware masks to guide concept fusion and introduces a training-free video extension via residual feature injection in an image-to-video pipeline. Experimental results show superior quantitative alignment (CLIP, DINO) and perceptual quality, along with favorable user studies, compared to strong baselines, and ablations confirm the contribution of CFG++, resampling, and denoised-space mixing. The approach offers a practical, scalable solution for multi-concept generation without weight merging or inversion steps, with broad applicability to image and video generation pipelines.

Abstract

Despite significant advancements in customizing text-to-image and video generation models, generating images and videos that effectively integrate multiple personalized concepts remains a challenging task. To address this, we present TweedieMix, a novel method for composing customized diffusion models during the inference phase. By analyzing the properties of reverse diffusion sampling, our approach divides the sampling process into two stages. During the initial steps, we apply a multiple object-aware sampling technique to ensure the inclusion of the desired target objects. In the later steps, we blend the appearances of the custom concepts in the de-noised image space using Tweedie's formula. Our results demonstrate that TweedieMix can generate multiple personalized concepts with higher fidelity than existing methods. Moreover, our framework can be effortlessly extended to image-to-video diffusion models, enabling the generation of videos that feature multiple personalized concepts. Results and source code are in our anonymous project page.
Paper Structure (15 sections, 6 equations, 18 figures, 2 tables)

This paper contains 15 sections, 6 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Multi-concept Generation Results from TweedieMix. Our model can generate high-quality multi-concept generation results on both of image and video domains. More results can be found in the experiment section.
  • Figure 2: Method Overview. (a) To enhance the multi-object generation of text-to-image model, we use content-aware sampling in which we sample the image with non fine-tuned model ${\boldsymbol \epsilon}_{\theta_0}$ and multi-object aware text $\mathbf{c}_{mul}$. In the intermediate step $t_{con}$, we extract mask from the images denoised with Tweedie's formula. (b) After $t_{con}$, we apply custom concept using region-wise guidance and concept-wise finetuned models. We propose to region-wise mixing of different models in Tweedie's denoised space.
  • Figure 3: Resampling Strategy. To improve the multi-object sampling in content-aware sampling stage, we use resampling strategy. At initial timestep $T$, we subtract the single-concept samples from multi-concept samples to fortify the multi-concept text condition. This process is again calculated in the denoised space using Tweedie's formula. With the denoised image visualizations, we can see the effectiveness of our proposed resampling.
  • Figure 4: Method for Video Extension. To preserve the context of reference image which is generated from our multi-concept sampling strategy, we propose to inject the residual features of first frame to the other frame features.
  • Figure 5: Qualitative Evaluation of Multi-Concept Image Generation. We evaluate the image generation quality of our method in comparison to baseline approaches, using prompts that incorporate each concept displayed on the left. In the overall results, our method maintains the appearance of the target concepts without any concept missing problems, whereas the baseline methods fails to preserve the identity of the concepts or generate the intended action corresponding the text.
  • ...and 13 more figures