CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Yigit Ekin; Ahmet Burak Yildirim; Erdem Eren Caglar; Aykut Erdem; Erkut Erdem; Aysegul Dundar

CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Yigit Ekin, Ahmet Burak Yildirim, Erdem Eren Caglar, Aykut Erdem, Erkut Erdem, Aysegul Dundar

TL;DR

Diffusion-based inpainting often hallucinates removed objects when removing content without explicit guidance. The authors present CLIPAway, which leverages AlphaCLIP embeddings to emphasize background regions and uses an MLP to align these embeddings with the IP-Adapter space, applying a background-focused vector subtraction $\mathbf{e}_{\text{final}} = \mathbf{e}_{\text{b}} - ( (\mathbf{e}_{\text{b}} \cdot \mathbf{e}_{\text{f}})/\|\mathbf{e}_{\text{f}}\| ) ( \mathbf{e}_{\text{f}} / \|\mathbf{e}_{\text{f}}\| )$ to suppress foreground content. The approach is plug-and-play and data-agnostic, compatible with multiple diffusion-based inpainting methods, and evaluated on COCO 2017 with quantitative metrics and a user study showing strong preference for CLIPAway. It highlights practical impact for image restoration and editing while noting ethical considerations and limitations, such as slower speed relative to GANs and shadows not removed unless included in the mask.

Abstract

Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images. Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object. To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques.

CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

TL;DR

to suppress foreground content. The approach is plug-and-play and data-agnostic, compatible with multiple diffusion-based inpainting methods, and evaluated on COCO 2017 with quantitative metrics and a user study showing strong preference for CLIPAway. It highlights practical impact for image restoration and editing while noting ethical considerations and limitations, such as slower speed relative to GANs and shadows not removed unless included in the mask.

Abstract

Paper Structure (17 sections, 2 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 12 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminaries
CLIPAway
Experiments
Baselines
Datasets
Metrics
Results
Conclusion and Broader Impacts
Supplementary Material
Training and Inference Algorithm
MLP model and training details
Conditional Image Generation Examples
...and 2 more sections

Figures (12)

Figure 1: Diffusion-based inpainting methods often struggle with object removal tasks. Instead of seamlessly filling the erased area with background elements, diffusion models may unintentionally replace the removed object with another or add irrelevant objects. This outcome diverges from the user's intention, which is typically to restore the area with the background alone, without introducing new elements. Our method, CLIPAway, aims at amending this deficiency by precisely focusing on maintaining the integrity of the background, ensuring that the space is filled as intended by the user.
Figure 2: Limitations of IP-Adapter ye2023ip for Inpainting. Direct use of the IP-Adapter with the input image as the image prompt is ineffective for inpainting, as it predictably fills the erased area with the original object. In addition, directly giving the prompt "background" is also problematic as the background can also contain instances of the images that we want to remove, resulting in a direct replacement of the foreground object. On the other hand, using an erased image as the prompt results in the generation of black artifacts.
Figure 3: The overall framework of CLIPAway. Input images, comprising both foreground and background elements, are embedded via AlphaCLIP. These embedded images are then processed through an MLP trained to adapt features to the IPAdapter input space. Through vector arithmetic on the features, a background embedding without foreground influence is achieved. SDInpaint is depicted as if it is working on the image space for clarity; it works on the latent space.
Figure 4: Starting with an input image and mask, we present our findings utilizing both foreground and background-focused embeddings. The images in the first row depict the conditional image generation outcomes of the stable-diffusion model without the inpainting task. These visuals offer insights into the focus of the embeddings. While both embeddings capture features from various parts of the image, the foreground embedding tends to emphasize the foreground, whereas the background embedding predominantly focuses on the background but still contains the foreground. Our approach successfully removes the foreground in the generated results, yielding pure background. This outcome is consistent with the image inpainting outputs, as demonstrated in the second row.
Figure 5: Comparison of CLIPAway with state-of-the-art methods based on image quality and inpainting accuracy.
...and 7 more figures

CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

TL;DR

Abstract

CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)