Table of Contents
Fetching ...

LIME: Localized Image Editing via Attention Regularization in Diffusion Models

Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, Federico Tombari

TL;DR

LIME introduces a localization-focused editing pipeline for diffusion-based text-guided edits that operates without model re-training or extra user inputs. It combines multi-resolution feature-based segmentation to identify a RoI with cross-attention guidance and a novel attention-regularization mechanism that confines edits to the RoI. The method, built on top of InstructPix2Pix, yields consistent qualitative and quantitative gains on benchmarks like MagicBrush, PIE-Bench, and EditVal, and demonstrates extension potential to other editing models. This approach advances controllability of diffusion models by enabling precise, localized edits while preserving surrounding content, with practical implications for efficient, user-friendly image editing. The work also discusses limitations and avenues for broad applicability and responsible use.

Abstract

Diffusion models (DMs) have gained prominence due to their ability to generate high-quality varied images with recent advancements in text-to-image generation. The research focus is now shifting towards the controllability of DMs. A significant challenge within this domain is localized editing, where specific areas of an image are modified without affecting the rest of the content. This paper introduces LIME for localized image editing in diffusion models. LIME does not require user-specified regions of interest (RoI) or additional text input, but rather employs features from pre-trained methods and a straightforward clustering method to obtain precise editing mask. Then, by leveraging cross-attention maps, it refines these segments for finding regions to obtain localized edits. Finally, we propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits. Our approach, without re-training, fine-tuning and additional user inputs, consistently improves the performance of existing methods in various editing benchmarks. The project page can be found at https://enisimsar.github.io/LIME/.

LIME: Localized Image Editing via Attention Regularization in Diffusion Models

TL;DR

LIME introduces a localization-focused editing pipeline for diffusion-based text-guided edits that operates without model re-training or extra user inputs. It combines multi-resolution feature-based segmentation to identify a RoI with cross-attention guidance and a novel attention-regularization mechanism that confines edits to the RoI. The method, built on top of InstructPix2Pix, yields consistent qualitative and quantitative gains on benchmarks like MagicBrush, PIE-Bench, and EditVal, and demonstrates extension potential to other editing models. This approach advances controllability of diffusion models by enabling precise, localized edits while preserving surrounding content, with practical implications for efficient, user-friendly image editing. The work also discusses limitations and avenues for broad applicability and responsible use.

Abstract

Diffusion models (DMs) have gained prominence due to their ability to generate high-quality varied images with recent advancements in text-to-image generation. The research focus is now shifting towards the controllability of DMs. A significant challenge within this domain is localized editing, where specific areas of an image are modified without affecting the rest of the content. This paper introduces LIME for localized image editing in diffusion models. LIME does not require user-specified regions of interest (RoI) or additional text input, but rather employs features from pre-trained methods and a straightforward clustering method to obtain precise editing mask. Then, by leveraging cross-attention maps, it refines these segments for finding regions to obtain localized edits. Finally, we propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits. Our approach, without re-training, fine-tuning and additional user inputs, consistently improves the performance of existing methods in various editing benchmarks. The project page can be found at https://enisimsar.github.io/LIME/.
Paper Structure (38 sections, 4 equations, 20 figures, 8 tables)

This paper contains 38 sections, 4 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: LIME: Localized IMage Editing. LIME edits an image based on an edit instruction without needing customized datasets, fine-tuning, or explicit information about the object of interest. The addition of LIME improves InstructPix2Pix (IP2P) Brooks2022InstructPix2Pix and its fine-tuned version on MagicBrush (MB) Zhang2023MagicBrush, human-annotated, dataset and allows localized edits preserving the rest of the image untouched.
  • Figure 2: Segmentation and RoI finding.Resolution Xs demonstrates segmentation maps from different resolutions, while Ours shows the segmentation map from our multi-resolution fusion method explained above. For the cross-attention map, the color yellow indicates high probability, and blue dots mark the N points with the highest probability. The last image shows the extracted RoI using blue dots and Ours.
  • Figure 3: Attention Regularization. Our method selectively regularizes unrelated tokens (SoT and stop words: her) within the RoI, ensuring precise, context-aware edits without the need for extra model training or user inputs. After attention regularization, the probabilities for the related tokens are attending the RoI, as illustrated in the second row.
  • Figure 4: Qualitative Examples. We test our method on different tasks: (a) editing a large segment, (b) altering texture, (c) editing multiple segments, (d) adding, (e) replacing, and (f) removing objects. The integration of LIME enhances the performance of baselines, enabling localized edits while preserving the remaining image areas.
  • Figure 5: More Qualitative Examples. We test our method on different tasks: (a) modifying multiple objects, (b) changing color, (c) adding, (d) removing objects, and (e) changing texture. The integration of LIME enhances the performance of all models, enabling localized edits while maintaining the integrity of the remaining image areas.
  • ...and 15 more figures