Table of Contents
Fetching ...

SPDiffusion: Semantic Protection Diffusion Models for Multi-concept Text-to-image Generation

Yang Zhang, Rui Zhang, Xuecheng Nie, Haochen Li, Jikun Chen, Yifan Hao, Xin Zhang, Luoqi Liu, Ling Li

TL;DR

SPDiffusion tackles semantic entanglement in multi-concept text-to-image generation by introducing SP-Extraction to locate concept regions and SP-Attn to shield these regions from irrelevant tokens, using only text prompts. The approach is training-free and focuses attention protection during the early denoising steps to minimize overhead. It delivers state-of-the-art performance on CC-500, Wearing-100, and Animals-100 against multiple baselines, with BLIP-VQA and InternVL-VQA scores confirming improved semantic consistency. By avoiding layout inputs and maintaining low additional cost, SPDiffusion offers a practical solution for reliable multi-concept image synthesis with broad applicability to illustration, storytelling, and visual design.

Abstract

Recent text-to-image models have achieved impressive results in generating high-quality images. However, when tasked with multi-concept generation creating images that contain multiple characters or objects, existing methods often suffer from semantic entanglement, including concept entanglement and improper attribute binding, leading to significant text-image inconsistency. We identify that semantic entanglement arises when certain regions of the latent features attend to incorrect concept and attribute tokens. In this work, we propose the Semantic Protection Diffusion Model (SPDiffusion) to address both concept entanglement and improper attribute binding using only a text prompt as input. The SPDiffusion framework introduces a novel concept region extraction method SP-Extraction to resolve region entanglement in cross-attention, along with SP-Attn, which protects concept regions from the influence of irrelevant attributes and concepts. To evaluate our method, we test it on existing benchmarks, where SPDiffusion achieves state-of-the-art results, demonstrating its effectiveness.

SPDiffusion: Semantic Protection Diffusion Models for Multi-concept Text-to-image Generation

TL;DR

SPDiffusion tackles semantic entanglement in multi-concept text-to-image generation by introducing SP-Extraction to locate concept regions and SP-Attn to shield these regions from irrelevant tokens, using only text prompts. The approach is training-free and focuses attention protection during the early denoising steps to minimize overhead. It delivers state-of-the-art performance on CC-500, Wearing-100, and Animals-100 against multiple baselines, with BLIP-VQA and InternVL-VQA scores confirming improved semantic consistency. By avoiding layout inputs and maintaining low additional cost, SPDiffusion offers a practical solution for reliable multi-concept image synthesis with broad applicability to illustration, storytelling, and visual design.

Abstract

Recent text-to-image models have achieved impressive results in generating high-quality images. However, when tasked with multi-concept generation creating images that contain multiple characters or objects, existing methods often suffer from semantic entanglement, including concept entanglement and improper attribute binding, leading to significant text-image inconsistency. We identify that semantic entanglement arises when certain regions of the latent features attend to incorrect concept and attribute tokens. In this work, we propose the Semantic Protection Diffusion Model (SPDiffusion) to address both concept entanglement and improper attribute binding using only a text prompt as input. The SPDiffusion framework introduces a novel concept region extraction method SP-Extraction to resolve region entanglement in cross-attention, along with SP-Attn, which protects concept regions from the influence of irrelevant attributes and concepts. To evaluate our method, we test it on existing benchmarks, where SPDiffusion achieves state-of-the-art results, demonstrating its effectiveness.
Paper Structure (28 sections, 14 equations, 11 figures, 1 table)

This paper contains 28 sections, 14 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Semantic Entanglement. Existing diffusion models usually suffer from semantic entanglement problem in multi-concept text-to-image generation, which contains following sub-problems: (a). Concept Entanglement. One concept feature transfers to another concept. (e.g., bear exhibit mouse like ear and mouth.) (b). Improper Attribute Binding. attribute of one concept binds to another concept. (e.g., red color binds to suitcase and gold color binds to clock. )
  • Figure 2: Semantic Entanglement Visualization. (a) Cross-attention map visualization shows both the mouse and bear regions merging mouse features, causing the bear's ear region to query image features (highlighted in the red box) associated with the mouse, resulting in a mouse-like ear. (b) When the bear region does not merge mouse features, it does not query the mouse ear feature in the red box, maintaining distinct bear features.
  • Figure 3: Concept Region Extraction. We visualize the normalized heat maps of both the cross-attention of mouse token and self-attention maps of anchor points. Additionally, we display masks and points within blue and orange boxes under varying thresholds.
  • Figure 4: Overview of SPDiffusion
  • Figure 5: Qualitative comparison. Our method generates address semantic entanglement problems on all three datasets.
  • ...and 6 more figures