Table of Contents
Fetching ...

MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration

Zhichao Wei, Qingkun Su, Long Qin, Weizhi Wang

TL;DR

<3-5 sentence high-level summary> MM-Diff tackles the challenge of fast, high-fidelity personalized image generation without fine-tuning. It harmonizes vision-derived embeddings with text conditioning via a multi-modal cross-attention mechanism enhanced by LoRA, and introduces a Subject Embedding Refiner to inject detailed subject information efficiently. To address multi-subject composition, it employs training-time cross-attention map constraints that align entity tokens to distinct image regions, enabling layout-free multi-subject sampling at inference. Extensive experiments across general, portrait, and multi-subject settings demonstrate superior subject fidelity and competitive text fidelity relative to both tuning-based and tuning-free baselines, with robust ablations validating each component.

Abstract

Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention mechanism. In this paper, we propose MM-Diff, a unified and tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. Specifically, to simultaneously enhance text consistency and subject fidelity, MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings, both of which are efficiently integrated into the diffusion model through the well-designed multimodal cross-attention mechanism. Additionally, MM-Diff introduces cross-attention map constraints during the training phase, ensuring flexible multi-subject image sampling during inference without any predefined inputs (e.g., layout). Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods.

MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration

TL;DR

<3-5 sentence high-level summary> MM-Diff tackles the challenge of fast, high-fidelity personalized image generation without fine-tuning. It harmonizes vision-derived embeddings with text conditioning via a multi-modal cross-attention mechanism enhanced by LoRA, and introduces a Subject Embedding Refiner to inject detailed subject information efficiently. To address multi-subject composition, it employs training-time cross-attention map constraints that align entity tokens to distinct image regions, enabling layout-free multi-subject sampling at inference. Extensive experiments across general, portrait, and multi-subject settings demonstrate superior subject fidelity and competitive text fidelity relative to both tuning-based and tuning-free baselines, with robust ablations validating each component.

Abstract

Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention mechanism. In this paper, we propose MM-Diff, a unified and tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. Specifically, to simultaneously enhance text consistency and subject fidelity, MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings, both of which are efficiently integrated into the diffusion model through the well-designed multimodal cross-attention mechanism. Additionally, MM-Diff introduces cross-attention map constraints during the training phase, ensuring flexible multi-subject image sampling during inference without any predefined inputs (e.g., layout). Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods.
Paper Structure (40 sections, 6 equations, 11 figures, 6 tables)

This paper contains 40 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Given a single reference image, our MM-Diff can generate diverse personalized images guided by the text prompt in seconds. Moreover, our model supports multi-subject image generation without any predefined inputs (e.g., layouts).
  • Figure 2: The overall pipeline of the proposed MM-Diff. On the left, the vision-augmented text embeddings and a small set of detail-rich subject embeddings are injected into the diffusion model through the well-designed multi-modal cross-attention. On the right, we illustrate the details of the innovative implementation of cross-attention with LoRAs, as well as the attention constraints that facilitate multi-subject generation.
  • Figure 3: Visual comparisons on single subject generation.
  • Figure 4: Visual comparisons on portrait generation.
  • Figure 5: Visual comparisons on multi-subject generation.
  • ...and 6 more figures