Click2Mask: Local Editing with Dynamic Mask Generation

Omer Regev; Omri Avrahami; Dani Lischinski

Click2Mask: Local Editing with Dynamic Mask Generation

Omer Regev, Omri Avrahami, Dani Lischinski

TL;DR

Click2Mask addresses local image editing with minimal user input by enabling edits around a single clicked point, guided by a dynamic mask evolving under a semantic loss. The approach uses Blended Latent Diffusion as the editing backbone and Alpha-CLIP to steer mask evolution, yielding a final edited image after a final BLD pass with a learned mask. Key contributions include eliminating the need for precise masks or detailed location prompts, enabling free-form object addition, and providing a mask-evolution mechanism that can be integrated into other editing methods. Empirical results show superior performance to state-of-the-art baselines in both human judgments and automatic metrics, with robust ablations supporting the design choices. The method offers a practical, user-friendly path for localized image manipulation in real-world workflows and can be embedded into broader editing pipelines.

Abstract

Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to non-experts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also enables competitive or superior local image manipulations compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.

Click2Mask: Local Editing with Dynamic Mask Generation

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 23 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 1 equation, 23 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Blended Latent Diffusion
Method
Dynamic Mask Evolution
Results
Human Evaluation
Automatic Metrics
Edited Alpha-CLIP.
Ablation Study
Model Limitations
Conclusion
Additional Experiments
Additional Results
Additional Ablation Study
...and 5 more sections

Figures (23)

Figure 1: Comparisons to SoTA models. A comparison of Emu Edit sheynin2023emu, MagicBrush Zhang2023MagicBrush and DALL$\cdot$E$\!$ 3 BetkerImprovingIG with our model Click2Mask. In each example, the top prompt was given to the other models, while Click2Mask received the simpler bottom prompt, in addition to the blue dot (mouse click) on the input. Other models completely change the image, or the background, fail to edit, or produce unrealistic results.
Figure 2: Mask evolution. A visualization of the mask evolution throughout the diffusion process. Leftmost image is input with clicked point, rightmost image is the final Click2Mask output. Intermediate images are decoded latents $\tilde{z}_\textit{fg}$ at several diffusion steps, where the purple outline depicts the contour of current (upscaled) mask $M_t$. Percentages indicate the step out of 100 diffusion steps, with the last being the final evolved mask.
Figure 3: Examples of Click2Mask outputs. The leftmost column is the input image with clicked point. The other columns are Click2Mask outputs given the prompts below.
Figure 4: Comparisons with SoTA methods. Comparisons of Emu Edit sheynin2023emu, MagicBrush Zhang2023MagicBrush and InstructPix2Pix brooks2022instructpix2pix with our model Click2Mask. Upper prompts were given to baselines, and lower ones to Click2Mask. The inputs contain the clicked point given to Click2Mask. As \ref{['fig:compare_issues']} shows, baselines often modify unrelated objects, make global changes, misplace elements, or replace rather than add objects. See appendix for more comparisons.
Figure 5: Examples of generated masks. For each triplet, given an input image with clicked point (left) and a prompt (below), a purple overlay shows the generated mask (middle). The rightmost image is Click2Mask output.
...and 18 more figures

Click2Mask: Local Editing with Dynamic Mask Generation

TL;DR

Abstract

Click2Mask: Local Editing with Dynamic Mask Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (23)