Table of Contents
Fetching ...

ReMOVE: A Reference-free Metric for Object Erasure

Aditya Chandrasekar, Goirik Chakrabarty, Jai Bardhan, Ramya Hebbalaguppe, Prathosh AP

TL;DR

ReMOVE introduces a reference-free metric for evaluating object erasure quality in diffusion-based image editing by leveraging Vision Transformer patch features to compare mean representations of masked and unmasked regions. The metric defines $ReMOVE=\mathcal{S}(\bar{\mathbf{z}}_m,\bar{\mathbf{z}}_u)$ using cosine similarity, with a cropping step to ensure fair patch-count comparisons across masks of varying sizes. Empirical results on synthetic toy data and real-world DEFACTO data show that ReMOVE correlates with perceptual quality and aligns with human judgments, outperforming CLIPScore in many scenarios. The work demonstrates that deep-feature, reference-free evaluation can reliably assess inpainting outcomes when ground-truth references are unavailable, aiding the development and deployment of diffusion-based image editing tools.

Abstract

We introduce $\texttt{ReMOVE}$, a novel reference-free metric for assessing object erasure efficacy in diffusion-based image editing models post-generation. Unlike existing measures such as LPIPS and CLIPScore, $\texttt{ReMOVE}$ addresses the challenge of evaluating inpainting without a reference image, common in practical scenarios. It effectively distinguishes between object removal and replacement. This is a key issue in diffusion models due to stochastic nature of image generation. Traditional metrics fail to align with the intuitive definition of inpainting, which aims for (1) seamless object removal within masked regions (2) while preserving the background continuity. $\texttt{ReMOVE}$ not only correlates with state-of-the-art metrics and aligns with human perception but also captures the nuanced aspects of the inpainting process, providing a finer-grained evaluation of the generated outputs.

ReMOVE: A Reference-free Metric for Object Erasure

TL;DR

ReMOVE introduces a reference-free metric for evaluating object erasure quality in diffusion-based image editing by leveraging Vision Transformer patch features to compare mean representations of masked and unmasked regions. The metric defines using cosine similarity, with a cropping step to ensure fair patch-count comparisons across masks of varying sizes. Empirical results on synthetic toy data and real-world DEFACTO data show that ReMOVE correlates with perceptual quality and aligns with human judgments, outperforming CLIPScore in many scenarios. The work demonstrates that deep-feature, reference-free evaluation can reliably assess inpainting outcomes when ground-truth references are unavailable, aiding the development and deployment of diffusion-based image editing tools.

Abstract

We introduce , a novel reference-free metric for assessing object erasure efficacy in diffusion-based image editing models post-generation. Unlike existing measures such as LPIPS and CLIPScore, addresses the challenge of evaluating inpainting without a reference image, common in practical scenarios. It effectively distinguishes between object removal and replacement. This is a key issue in diffusion models due to stochastic nature of image generation. Traditional metrics fail to align with the intuitive definition of inpainting, which aims for (1) seamless object removal within masked regions (2) while preserving the background continuity. not only correlates with state-of-the-art metrics and aligns with human perception but also captures the nuanced aspects of the inpainting process, providing a finer-grained evaluation of the generated outputs.
Paper Structure (13 sections, 8 figures, 1 table)

This paper contains 13 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Motivation for ReMOVE: Comparison of ReMOVE with CLIPScore, illustrating the latter's lack of distinction (denoted by ✓ and ✗) between two inpainting methods: method , which erases the object, and method , which replaces it with another.
  • Figure 2: Randomness in Object Inpainting using SD-Inpaint: Samples of diffusion-based image inpainting using SD-Inpaint stablediff generated across varying seeds. The object intended for inpainting is substituted with a different object rather than replacing it with the background. In some cases (column 6), the model replaces the object with background pixels as desired.
  • Figure 3: Randomness in Object Inpainting with Other Methods: Samples of diffusion-based image inpainting using multiple methods. The object intended for inpainting is often substituted with a different object rather than replacing it with the background.
  • Figure 4: Schematic Diagram of ReMOVE: The inpainter takes the original image and (optionally) object mask to produce an edited image with the object deleted. Our metric requires only the edited image and object mask. After preprocessing using a bounding box crop (only in the crop-variant) and resizing, the image is tokenized into patches, and the encoder $\mathcal{E}$ obtains features for each patch. Simultaneously, the mask is resized and used to split patch embeddings into object and background embeddings. The mean feature embeddings are compared using a similarity measure to yield ReMOVE.
  • Figure 5: Toy Dataset made using background images, randomly selected masks and SD-Inpaint.
  • ...and 3 more figures