Table of Contents
Fetching ...

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Achint Soni, Meet Soni, Sirisha Rambhatla

TL;DR

LOCATEdit addresses imprecise cross-attention edits in diffusion-based image editing by introducing CASA graph-based refinement and a graph Laplacian regularizer to enforce spatial coherence. It constructs CASA graphs from cross- and self-attention and performs a closed-form graph Laplacian optimization to produce a refined, localized editing mask. The method combines selective embedding interpolation via an IP-Adapter and a dual-branch, training-free editing framework to keep edits confined to target regions while preserving background structure, with a provable solution $\mathbf{m}^* = (\mathbf{\Lambda} + \lambda \mathbf{L})^{-1} \mathbf{\Lambda} \mathbf{m}_0$. Experiments on PIE-Bench show state-of-the-art performance in structure fidelity and CLIP-based alignment, enabling reliable localized text-guided edits in diverse scenes.

Abstract

Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. LOCATEdit consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks. Code can be found on https://github.com/LOCATEdit/LOCATEdit/

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

TL;DR

LOCATEdit addresses imprecise cross-attention edits in diffusion-based image editing by introducing CASA graph-based refinement and a graph Laplacian regularizer to enforce spatial coherence. It constructs CASA graphs from cross- and self-attention and performs a closed-form graph Laplacian optimization to produce a refined, localized editing mask. The method combines selective embedding interpolation via an IP-Adapter and a dual-branch, training-free editing framework to keep edits confined to target regions while preserving background structure, with a provable solution . Experiments on PIE-Bench show state-of-the-art performance in structure fidelity and CLIP-based alignment, enabling reliable localized text-guided edits in diverse scenes.

Abstract

Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. LOCATEdit consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks. Code can be found on https://github.com/LOCATEdit/LOCATEdit/

Paper Structure

This paper contains 24 sections, 2 theorems, 30 equations, 6 figures, 4 tables.

Key Result

Lemma 1

The graph Laplacian $\mathbf{L} \in \mathbb{R}^{R^2\times R^2}$ is positive semidefinite.

Figures (6)

  • Figure 1: Our LOCATEdit demonstrates strong performance on various complex image editing tasks.
  • Figure 2: Example of over-editing caused due to imprecise masks.
  • Figure 3: Overview of our text-guided image editing pipeline. LOCATEdit refines cross-attention maps with graph Laplacian regularization for spatial consistency, uses an IP-Adapter for additional guidance, and employs selective pruning on text embeddings to suppress noise, ensuring the edited image preserves key structural details.
  • Figure 4: CASA (Cross and Self-Attention) Graph Construction workflow. The initial cross-attention maps are upsampled to form a patch-level adjacency graph, then Laplacian regularization enforces spatial consistency. Thresholding the refined maps yields final, more robust attention masks.
  • Figure 5: Illustration of the convex objective $J(\textbf{m})$ in a 2D slice of the higher-dimensional space. The single global minimum, marked in red, highlights the function’s convex nature.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Lemma 1
  • Theorem 1
  • proof