Table of Contents
Fetching ...

A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control

Jiawei Lin, Guanlong Jiao, Jianjin Xu

TL;DR

MultiID addresses multi-ID customization with two major challenges: copy-paste artifacts and limited text controllability. It proposes a training-free framework that reuses a strong pre-trained single-ID diffusion model via ID-decoupled cross-attention, coupled with depth-guided spatial control and extended self-attention to manage multiple identities and align with text prompts. A new benchmark, IDBench, is built to evaluate local prompt alignment, global prompt adherence, and ID consistency, where MultiID demonstrates competitive or superior performance to training-based methods while avoiding training overhead. The approach offers a practical, modular solution for high-quality multi-ID customization in real-world applications, with potential extensions to broader conditional generation tasks.

Abstract

Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.

A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control

TL;DR

MultiID addresses multi-ID customization with two major challenges: copy-paste artifacts and limited text controllability. It proposes a training-free framework that reuses a strong pre-trained single-ID diffusion model via ID-decoupled cross-attention, coupled with depth-guided spatial control and extended self-attention to manage multiple identities and align with text prompts. A new benchmark, IDBench, is built to evaluate local prompt alignment, global prompt adherence, and ID consistency, where MultiID demonstrates competitive or superior performance to training-based methods while avoiding training overhead. The approach offers a practical, modular solution for high-quality multi-ID customization in real-world applications, with potential extensions to broader conditional generation tasks.

Abstract

Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.

Paper Structure

This paper contains 20 sections, 7 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: MultiID shows better performance than existing methods in terms of generation quality and controllability.
  • Figure 2: Illustration of our proposed MultiID. MultiID takes a global prompt, multiple local prompts and ID images as input, and produces a personalized image accordingly. It consists of three critical components, termed (1) depth-guided spatial control, (2) extended self-attention, and (3) ID-decoupled cross-attention.
  • Figure 3: Qualitative comparison of different methods on IDBench. We highlight the description of human interactions in prompts, emphasizing the comparison of human postures in the generated images.
  • Figure 4: Visualization results of complex interactions.
  • Figure 5: Qualitative analysis of ablation studies. We underline the issues below the images. Ill interaction indicates the interactions between IDs are unadjusted. ID incorrect denotes that the appearance of personalized ID is not inconsistent with its reference. ID mismatch represents the confusion arises between different IDs. The red boxes mark where the mentioned issues occur.
  • ...and 4 more figures