Editable Image Elements for Controllable Synthesis

Jiteng Mu; Michaël Gharbi; Richard Zhang; Eli Shechtman; Nuno Vasconcelos; Xiaolong Wang; Taesung Park

Editable Image Elements for Controllable Synthesis

Jiteng Mu, Michaël Gharbi, Richard Zhang, Eli Shechtman, Nuno Vasconcelos, Xiaolong Wang, Taesung Park

TL;DR

This work introduces editable image elements, a patch-based, spatially controllable representation that enables realistic editing of user-provided images with diffusion models. By partitioning an image into semantically meaningful patches and separately encoding their appearance and location, the method couples a content encoder with a diffusion decoder conditioned on both text and image elements, while employing dropout-based training to improve robustness to edits. The approach supports object resizing, rearrangement, removal, inpainting, and image composition, and outperforms several baselines in both reconstruction fidelity and edit quality, as demonstrated by comprehensive experiments and user studies. The proposed framework offers a fast, interactive pathway for spatial image editing with diffusion models, while highlighting current limitations and directions for richer appearance control and higher-resolution capabilities.

Abstract

Diffusion models have made significant advances in text-guided synthesis tasks. However, editing user-provided images remains challenging, as the high dimensional noise input space of diffusion models is not naturally suited for image inversion or spatial editing. In this work, we propose an image representation that promotes spatial editing of input images using a diffusion model. Concretely, we learn to encode an input into "image elements" that can faithfully reconstruct an input image. These elements can be intuitively edited by a user, and are decoded by a diffusion model into realistic images. We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition. Project page: https://jitengmu.github.io/Editable_Image_Elements/

Editable Image Elements for Controllable Synthesis

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 24 figures, 2 tables, 2 algorithms)

This paper contains 21 sections, 5 equations, 24 figures, 2 tables, 2 algorithms.

Introduction
Related Works
Method
Image Elements
Content Encoder
Diffusion Decoder
Experiments
Dataset and Training Details
Spatial Editing
Object Variations, Removal, and Composition
Ablation Studies
Discussion and Limitations
Additional Editing Comparison
Reconstruction Comparison
Implementation Details
...and 6 more sections

Figures (24)

Figure 1: We propose editable image elements, a flexible representation that faithfully reconstructs an input image, while enabling various spatial editing operations. (top) The user simply identifies interesting image elements (red dots) and edits their locations and sizes (green dots). Our model automatically performs object-shrinking and de-occlusion in the output image to respect the edited elements. For example, the missing corners of the car are inpainted. (bottom) More editing outputs are shown: object replacement, object removal, re-arrangement, and image composition.
Figure 2: Overview of our image editing pipeline. (top) To encode the image, we extract features from Segment Anything Model kirillov2023segment with equally spaced query points and perform simple clustering to obtain grouping of object parts with comparable sizes, resembling superpixels achanta2012slic. Each element is individually encoded with our convolutional encoder and is associated with its centroid and size parameters to form image elements. (bottom) The user can directly modify the image elements, such as moving, resizing, or removing. We pass the modified image elements to our diffusion-based decoder along with a text description of the overall scene to synthesize a realistic image that respects the modified elements.
Figure 3: Details of our diffusion-based decoder. First, we obtain positional embeddings of the location and size of the image elements, and concatenate them with the content embeddings to produce attention tokens to be passed to the diffusion model. Our diffusion model is a finetuned text-to-image Stable Diffusion UNet, with extra cross-attention layers on the image elements. The features from the text cross-attention layer and image element cross-attention layer are added equally to the self-attention features. Both conditionings are used to perform classifier-free guidance with equal weights.
Figure 4: The user can directly edit the image elements with simple selection, dragging, resizing, and deletion operations. The selected and edited elements are highlighted with red and green dots at the centroid of each element.
Figure 5: The user can directly edit the image elements with simple selection, dragging, resizing, and deletion operations. The selected and edited elements are highlighted with red and green dots at the centroid of each element.
...and 19 more figures

Editable Image Elements for Controllable Synthesis

TL;DR

Abstract

Editable Image Elements for Controllable Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (24)