Table of Contents
Fetching ...

Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention

Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, Huan Yang

TL;DR

This work tackles the challenge of multi-subject consistency in open-domain image generation with diffusion models. It identifies two key issues in existing training-free approaches: internal attraction among subjects in self-attention and misalignment between reference and target positions, which degrade consistency when handling multiple subjects. The authors propose IR-Diffusion, introducing Isolation Attention to block cross-subject attraction and Reposition Attention to align reference features with target subject positions. Extensive experiments demonstrate that IR-Diffusion substantially improves consistency metrics while remaining training-free, outperforming prior methods across open-domain benchmarks and backbones. The study provides insight into diffusion-attention mechanics and suggests broader applicability to related generative tasks.

Abstract

Training-free diffusion models have achieved remarkable progress in generating multi-subject consistent images within open-domain scenarios. The key idea of these methods is to incorporate reference subject information within the attention layer. However, existing methods still obtain suboptimal performance when handling numerous subjects. This paper reveals two primary issues contributing to this deficiency. Firstly, the undesired internal attraction between different subjects within the target image can lead to the convergence of multiple subjects into a single entity. Secondly, tokens tend to reference nearby tokens, which reduces the effectiveness of the attention mechanism when there is a significant positional difference between subjects in reference and target images. To address these issues, we propose a training-free diffusion model with Isolation and Reposition Attention, named IR-Diffusion. Specifically, Isolation Attention ensures that multiple subjects in the target image do not reference each other, effectively eliminating the subject convergence. On the other hand, Reposition Attention involves scaling and repositioning subjects in both reference and target images to the same position within the images. This ensures that subjects in the target image can better reference those in the reference image, thereby maintaining better consistency. Extensive experiments demonstrate that IR-Diffusion significantly enhances multi-subject consistency, outperforming all existing methods in open-domain scenarios.

Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention

TL;DR

This work tackles the challenge of multi-subject consistency in open-domain image generation with diffusion models. It identifies two key issues in existing training-free approaches: internal attraction among subjects in self-attention and misalignment between reference and target positions, which degrade consistency when handling multiple subjects. The authors propose IR-Diffusion, introducing Isolation Attention to block cross-subject attraction and Reposition Attention to align reference features with target subject positions. Extensive experiments demonstrate that IR-Diffusion substantially improves consistency metrics while remaining training-free, outperforming prior methods across open-domain benchmarks and backbones. The study provides insight into diffusion-attention mechanics and suggests broader applicability to related generative tasks.

Abstract

Training-free diffusion models have achieved remarkable progress in generating multi-subject consistent images within open-domain scenarios. The key idea of these methods is to incorporate reference subject information within the attention layer. However, existing methods still obtain suboptimal performance when handling numerous subjects. This paper reveals two primary issues contributing to this deficiency. Firstly, the undesired internal attraction between different subjects within the target image can lead to the convergence of multiple subjects into a single entity. Secondly, tokens tend to reference nearby tokens, which reduces the effectiveness of the attention mechanism when there is a significant positional difference between subjects in reference and target images. To address these issues, we propose a training-free diffusion model with Isolation and Reposition Attention, named IR-Diffusion. Specifically, Isolation Attention ensures that multiple subjects in the target image do not reference each other, effectively eliminating the subject convergence. On the other hand, Reposition Attention involves scaling and repositioning subjects in both reference and target images to the same position within the images. This ensures that subjects in the target image can better reference those in the reference image, thereby maintaining better consistency. Extensive experiments demonstrate that IR-Diffusion significantly enhances multi-subject consistency, outperforming all existing methods in open-domain scenarios.

Paper Structure

This paper contains 37 sections, 11 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Illustration of our idea. (a) Internal attraction among subjects in the target image leads to the convergence of multiple subjects into a single entity. Our Isolation Attention (IA) effectively mitigates this problem, ensuring each subject can be independently generated. (b) Misalignment between subjects in the reference and target images results in the ineffective utilization of reference image features. Our Reposition Attention (RA) aligns the subjects in the reference and target images, thereby enabling the better utilization of the reference image and preserving fine-grained consistency.
  • Figure 2: Illustration of our IR-Diffusion: (a) Isolation Attention (IA): IA isolates internal attraction between different subjects by ensuring that subjects do not receive responses from the Key and Value of other subjects. (b) Reposition Attention (RA): RA repositions the image features of the reference subjects to align with the positions of the corresponding subjects in the target image, enabling the model to more effectively utilize information from the reference image.
  • Figure 3: Comparison of the overall Self-Attention mechanism between DreamStory and our IR-Diffusion. This figure summarizes how the Queries, Keys, and Values are computed for different regions (subjects and background). The differences compared to the baseline (DreamStory dreamstory) are marked by colored dashed boxes: blue represents IA, and orange represents RA.
  • Figure 4: Average response values between tokens at varying distances in the self-attention layer. As the distance increases, the mean response values generally decrease, suggesting that tokens tend to reference nearby tokens more strongly.
  • Figure 5: Comparisons of multi-subject consistency generation between our IR-Diffusion and other SOTA methods. The superior performance of our approach is evident from the more visually appealing and consistent results. Except for StoryDiffusion, which uses the portraits above it for reference, all other methods use the top portraits as a reference. Different subjects are indicated with different colors.
  • ...and 9 more figures