LocInv: Localization-aware Inversion for Text-Guided Image Editing

Chuanming Tang; Kai Wang; Fei Yang; Joost van de Weijer

LocInv: Localization-aware Inversion for Text-Guided Image Editing

Chuanming Tang, Kai Wang, Fei Yang, Joost van de Weijer

TL;DR

LocInv addresses cross-attention leakage in text-guided image editing with diffusion models by integrating localization priors (segmentation maps or bounding boxes) into a dynamic token update framework. It defines similarity, overlap, and adjective-binding losses and uses progressive thresholds $TH_t = \beta \exp(-t/\alpha)$ along with Null-Text embeddings to align attention maps $\mathcal{A}_t$ with priors while preserving image reconstruction. On COCO-edit with Stable Diffusion, LocInv achieves superior editing quality and background preservation without model fine-tuning, and enables attribute edits through adjective-noun binding. This approach enhances robustness for multi-object scenes and demonstrates practical, localization-aware editing suitable for real-world applications.

Abstract

Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at https://github.com/wangkai930418/DPL

LocInv: Localization-aware Inversion for Text-Guided Image Editing

TL;DR

along with Null-Text embeddings to align attention maps

with priors while preserving image reconstruction. On COCO-edit with Stable Diffusion, LocInv achieves superior editing quality and background preservation without model fine-tuning, and enables attribute edits through adjective-noun binding. This approach enhances robustness for multi-object scenes and demonstrates practical, localization-aware editing suitable for real-world applications.

Abstract

Paper Structure (20 sections, 8 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 11 figures, 4 tables, 1 algorithm.

Introduction
Related work
Methodology
Preliminary
Dynamic Prompt Learning
LocInv: Localization-aware Inversion
Adjective binding
Experiments
Ablation study
Image editing evaluation
Conclusion
Limitations
Broader Impacts
Dataset statistics
User Study
...and 5 more sections

Figures (11)

Figure 1: Compared with the naive DDIM inversion, our method LocInv aims at enhancing the cross-attention maps by applying localization priors (segmentation maps or detection bounding boxes provided by the datasets or foundation models) to guide the inversion processes. Furthermore, to force strong bindings between adjective and noun words, we constrain the cross-attention similarity between them.
Figure 2: Illustration of our proposed method LocInv. The image $\mathcal{I}$ comes with its localization prior denoted as $S$ (segmentation maps or detection boxes). For each time stamp $t$, the noun (and optionally adjective) words in the text prompt are transformed into dynamic tokens, as introduced in Sec \ref{['subsec:dpl']}. In each denoising step ${\bar{z}}_{t-1} \rightarrow {\bar{z}}_t$, we update the dynamic token set ${\mathcal{V}_t}$ with our proposed overlapping loss, similarity loss and adjective binding loss, in order to ensure high-quality cross-attention maps.
Figure 3: Ablation study over hyperparameters given the Segment-Prior (first row) or Detection-Prior (second row). For the first and second columns, we ablate hyperparameters for the similarity loss and overlapping loss, respectively. Then we illustrate how the trade-off parameters influence in the third column. Lastly, we show the IoU curves of LocInv together with NTI and DPL as baseline comparisons.
Figure 4: Comparison over the local object Word-Swap editing given the Segment-Prior. All examples are from the COCO-edit dataset. We distinguish these comparison methods by (1) freezing the SD Rombach_2022_CVPR_stablediffusion models; (2) fine-tuning the SD models or mask-based inpainting.
Figure 5: Attribute-Edit by swapping the adjectives given the Segment-Prior. By forcing the binding between the cross-attention from the adjective words and corresponding noun words, LocInv successfully edits the color or material attribute.
...and 6 more figures

LocInv: Localization-aware Inversion for Text-Guided Image Editing

TL;DR

Abstract

LocInv: Localization-aware Inversion for Text-Guided Image Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (11)