Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation

Junjie Shentu; Matthew Watson; Noura Al Moubayed

Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation

Junjie Shentu, Matthew Watson, Noura Al Moubayed

TL;DR

This work addresses the challenge of customizing subject-driven text-to-image models when input images contain multiple concepts, where existing methods struggle to dissociate and learn a single target concept. It introduces Textual Localization, a diffusion-based approach that uses segmentation-guided cross-attention and two guidance modes (hard and soft) to disentangle concepts and bind a dedicated identifier token to the target concept during fine-tuning. The method combines denoising objective, class-prior preservation, and an attention loss to steer the model’s attention toward the target region, achieving improved image fidelity and image-text alignment on multi-concept data, with interpretable cross-attention maps. Empirically, hard guidance often yields the strongest concept localization and multi-concept fidelity, while soft guidance offers strong text alignment and preserves broader semantic information. Overall, the approach enables precise, interpretable, and scalable subject personalization for multi-concept inputs in diffusion-based text-to-image generation, outperforming or matching strong baselines on key metrics.

Abstract

Subject-driven text-to-image diffusion models empower users to tailor the model to new concepts absent in the pre-training dataset using a few sample images. However, prevalent subject-driven models primarily rely on single-concept input images, facing challenges in specifying the target concept when dealing with multi-concept input images. To this end, we introduce a textual localized text-to-image model (Texual Localization) to handle multi-concept input images. During fine-tuning, our method incorporates a novel cross-attention guidance to decompose multiple concepts, establishing distinct connections between the visual representation of the target concept and the identifier token in the text prompt. Experimental results reveal that our method outperforms or performs comparably to the baseline models in terms of image fidelity and image-text alignment on multi-concept input images. In comparison to Custom Diffusion, our method with hard guidance achieves CLIP-I scores that are 7.04%, 8.13% higher and CLIP-T scores that are 2.22%, 5.85% higher in single-concept and multi-concept generation, respectively. Notably, our method generates cross-attention maps consistent with the target concept in the generated images, a capability absent in existing models.

Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation

TL;DR

Abstract

Paper Structure (21 sections, 9 equations, 11 figures, 8 tables)

This paper contains 21 sections, 9 equations, 11 figures, 8 tables.

Introduction
Related Work
Text-to-image diffusion model
Subject-driven text-to-image generation
Cross-attention in text-to-image image generation
Textual Localized Diffusion Model
Preliminaries
Pipeline of Textual Localization
Cross-attention Guidance
Experiments and Results
Experimental setup
Single-concept generation
Multi-concept generation
Probing into cross-attention maps
Ablation study
...and 6 more sections

Figures (11)

Figure 1: Failure cases in single-concept generation by Custom Diffusion when fine-tuning on multi-concept inputs
Figure 2: Illustration of a single step of the fine-tuning process of Textual Localization
Figure 3: Qualitative comparison in single-concept generation
Figure 4: Qualitative comparison in multi-concept generation
Figure 5: Images samples and cross-attention maps of identifier tokens generated by adopted models fine-tuned on multi-concept input images
...and 6 more figures

Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation

TL;DR

Abstract

Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)