ReMOVE: A Reference-free Metric for Object Erasure
Aditya Chandrasekar, Goirik Chakrabarty, Jai Bardhan, Ramya Hebbalaguppe, Prathosh AP
TL;DR
ReMOVE introduces a reference-free metric for evaluating object erasure quality in diffusion-based image editing by leveraging Vision Transformer patch features to compare mean representations of masked and unmasked regions. The metric defines $ReMOVE=\mathcal{S}(\bar{\mathbf{z}}_m,\bar{\mathbf{z}}_u)$ using cosine similarity, with a cropping step to ensure fair patch-count comparisons across masks of varying sizes. Empirical results on synthetic toy data and real-world DEFACTO data show that ReMOVE correlates with perceptual quality and aligns with human judgments, outperforming CLIPScore in many scenarios. The work demonstrates that deep-feature, reference-free evaluation can reliably assess inpainting outcomes when ground-truth references are unavailable, aiding the development and deployment of diffusion-based image editing tools.
Abstract
We introduce $\texttt{ReMOVE}$, a novel reference-free metric for assessing object erasure efficacy in diffusion-based image editing models post-generation. Unlike existing measures such as LPIPS and CLIPScore, $\texttt{ReMOVE}$ addresses the challenge of evaluating inpainting without a reference image, common in practical scenarios. It effectively distinguishes between object removal and replacement. This is a key issue in diffusion models due to stochastic nature of image generation. Traditional metrics fail to align with the intuitive definition of inpainting, which aims for (1) seamless object removal within masked regions (2) while preserving the background continuity. $\texttt{ReMOVE}$ not only correlates with state-of-the-art metrics and aligns with human perception but also captures the nuanced aspects of the inpainting process, providing a finer-grained evaluation of the generated outputs.
