Table of Contents
Fetching ...

LoMOE: Localized Multi-Object Editing via Multi-Diffusion

Goirik Chakrabarty, Aditya Chandrasekar, Ramya Hebbalaguppe, Prathosh AP

TL;DR

LoMOE presents a zero-shot framework for localized multi-object editing in diffusion models by integrating inversion, a unified multidiffusion process, and attribute/background preservation losses. By operating on object masks with per-region prompts, it edits multiple targets in a single pass while preserving background structure and overall image coherence. The approach yields superior neural realism and competitive fidelity metrics, while offering substantial speed advantages over iterative multi-object editing methods. A new LoMOE-Bench dataset accompanies the method to benchmark multi-object editing, enhancing reproducibility and evaluation in this niche of image editing.

Abstract

Recent developments in the field of diffusion models have demonstrated an exceptional capacity to generate high-quality prompt-conditioned image edits. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. This framework empowers users to perform various operations on objects within an image, such as adding, replacing, or editing $\textbf{many}$ objects in a complex scene $\textbf{in one pass}$. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the current methods. We also curate and release a dataset dedicated to multi-object editing, named $\texttt{LoMOE}$-Bench. Our experiments against existing state-of-the-art methods demonstrate the improved effectiveness of our approach in terms of both image editing quality and inference speed.

LoMOE: Localized Multi-Object Editing via Multi-Diffusion

TL;DR

LoMOE presents a zero-shot framework for localized multi-object editing in diffusion models by integrating inversion, a unified multidiffusion process, and attribute/background preservation losses. By operating on object masks with per-region prompts, it edits multiple targets in a single pass while preserving background structure and overall image coherence. The approach yields superior neural realism and competitive fidelity metrics, while offering substantial speed advantages over iterative multi-object editing methods. A new LoMOE-Bench dataset accompanies the method to benchmark multi-object editing, enhancing reproducibility and evaluation in this niche of image editing.

Abstract

Recent developments in the field of diffusion models have demonstrated an exceptional capacity to generate high-quality prompt-conditioned image edits. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. This framework empowers users to perform various operations on objects within an image, such as adding, replacing, or editing objects in a complex scene . Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the current methods. We also curate and release a dataset dedicated to multi-object editing, named -Bench. Our experiments against existing state-of-the-art methods demonstrate the improved effectiveness of our approach in terms of both image editing quality and inference speed.
Paper Structure (38 sections, 11 equations, 11 figures, 6 tables)

This paper contains 38 sections, 11 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Representative results of LoMOE on diverse images: Our algorithm can handle both single and multi-object edits in one go. The first image in each example depicts the original image with the input mask (specifying the edit locations). Below each image is the original text used for its generation and the input text prompt (colored font) describing the edits. The second image depicts the edited image via our method. It is seen that our method can handle intricate localized object details such as heart color, earrings, window-view, multiple-cloud coloring, animal types in a painting, and tree-animal type.
  • Figure 2: Overview of the proposed LoMOE framework: ➀ Inversion, to obtain $x_{inv}$ and $c_0$ corresponding to the input image $\mathbf{x}_0$. ➁ MultiDiffusion process to restrict the edits to masked regions $M_1, M_2$ guided by $c_1, c_2$. ➂ Preservation of Attributes, via $\mathcal{L}_{xa}$ and $\mathcal{L}_b$ using reference cross-attention maps and background latents using a reconstruction process.
  • Figure 3: Comparison among contemporary methods for Single Object Edits: We observe that SDEdit sdedit and InstructP2P instruct-p2p tend to modify the whole image. GLIDE nichol2022glide often inpaints and removes the subject of the edit in cases where it fails to generate the edit. DiffEdit diffedit produces the same output as SDEdit while preserving the unmasked regions of the input image. BLD bld doesn't preserve the structure of the input and makes unintented attribute edits to the masked subject. Finally, we observe that our proposed LoMOE makes the intented edit, preserves the unmasked region and avoids unintended attribute edits.
  • Figure 4: Comparison with contemporary methods for Multi-Object Edits: While the baselines are either unable to make the edit, accumulate artifacts, edit the unmasked region, or make unintended attribute edits, LoMOE is able to faithfully edit in accordance with the prompts.
  • Figure 5: Additional Comparison among Contemporary Methods for Single Object Edits: We present a qualitative comparison of LoMOE against other baseline methods on additional single-object edits. The observations stand similar to Fig. 3 in the main paper, where our proposed method LoMOE makes the intented edit, preserves the unmasked region and avoids unintended attribute edits.
  • ...and 6 more figures