FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing

Trong-Tung Nguyen; Duc-Anh Nguyen; Anh Tran; Cuong Pham

FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing

Trong-Tung Nguyen, Duc-Anh Nguyen, Anh Tran, Cuong Pham

TL;DR

FlexEdit tackles the problem of fragile object-centric edits in diffusion-based image editing by introducing an iterative latent-editing framework. At each denoising step, a FlexEdit block performs latent optimization guided by explicit object constraints and latent blending using an adaptive binary mask derived from attention maps, balancing editing semantics with background fidelity. The approach is validated on real and synthetic data with new object-centric benchmarks and metrics, demonstrating competitive trade-offs against state-of-the-art methods and strong user preferences. Limitations include potential failures from imperfect attention-derived masks and higher computational cost due to multi-step optimization, motivating future work on faster or single-step editing while maintaining fidelity.

Abstract

Our work addresses limitations seen in previous approaches for object-centric editing problems, such as unrealistic results due to shape discrepancies and limited control in object replacement or insertion. To this end, we introduce FlexEdit, a flexible and controllable editing framework for objects where we iteratively adjust latents at each denoising step using our FlexEdit block. Initially, we optimize latents at test time to align with specified object constraints. Then, our framework employs an adaptive mask, automatically extracted during denoising, to protect the background while seamlessly blending new content into the target image. We demonstrate the versatility of FlexEdit in various object editing tasks and curate an evaluation test suite with samples from both real and synthetic images, along with novel evaluation metrics designed for object-centric editing. We conduct extensive experiments on different editing scenarios, demonstrating the superiority of our editing framework over recent advanced text-guided image editing methods. Our project page is published at https://flex-edit.github.io/.

FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing

TL;DR

Abstract

Paper Structure (32 sections, 10 equations, 16 figures, 5 tables)

This paper contains 32 sections, 10 equations, 16 figures, 5 tables.

Introduction
Related Works
Text-guided Image Editing with Diffusion Model
Controllable Image Synthesis
Background
Stable Diffusion Model
Cross-Attention and Self-Attention Layers in SD
Approach
Overview of Editing Framework
Dynamic Object Binary Mask from Attention Map
Latent Optimization with Object Constraints
Latent Blending with Adaptive Binary Mask
Iterative Latent Manipulation with FlexEdit
Experiments
Experimental Setup
...and 17 more sections

Figures (16)

Figure 1: Our framework could achieve robust and flexible control over several text-guided object-centric editing scenarios, including a) replacing objects with controllable size and position, b) adding new objects in a natural way without additional mask input, and c) removing objects without compromising the quality of the original image.
Figure 2: We show an editing scenario when edited object monkey and source object bear are distinct in shape. Our FlexEdit could achieve flexible shape transformation editing while preserving high fidelity to the source image's background information.
Figure 3: Overview of FlexEdit framework. Given an input image $I$, we first bring it to the intermediate source latents through an inversion process. Subsequently, the denoising process starts from ${z}^*_T$ cloned from $z_T$ after the inversion process and progresses toward $z_0^*$, which is then decoded to get the edited image $I^*$. At each denoising step, our FlexEdit block manipulates the noisy latent code through two main submodules: latent optimization (shown in blue), and latent blending (shown in orange). This is to achieve editing semantics as well as to maintain high fidelity to the source image. If the iterative process (shown in green) is not executed, our FlexEdit would return $z_t^*$.
Figure 4: Visualization of different versions of cross-attention maps and dynamic binary masks for edited object, i.e. truck during the denoising diffusion process.
Figure 5: Performance comparison of FlexEdit against existing editing techniques on the SynO, PieBenchO, and MagicO datasets. The method on the bottom right of each subplot provides the best background preservation and editing quality trade-off.
...and 11 more figures

FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing

TL;DR

Abstract

FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (16)