PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM

Stefan Stefanache; Lluís Pastor Pérez; Julen Costa Watanabe; Ernesto Sanchez Tejedor; Thomas Hofmann; Enis Simsar

PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM

Stefan Stefanache, Lluís Pastor Pérez, Julen Costa Watanabe, Ernesto Sanchez Tejedor, Thomas Hofmann, Enis Simsar

TL;DR

PixLens presents an automatic, task-aware benchmark for diffusion-based text-guided image editing that overcomes limitations of CLIP- and FID-centric evaluations by integrating SAM-based detection with segmentation masks to assess edit quality, subject preservation, and background fidelity. It introduces a nine-edit-type pipeline with continuous scoring, multiplicity handling, and a disentanglement analysis to understand how latent representations map to edits. The framework demonstrates stronger alignment with human judgments than prior benchmarks (e.g., EditVAL) and reveals that spatial edits remain challenging across models, while disentanglement metrics correlate with editing performance for several state-of-the-art methods. By preprocessing external datasets (MagicBrush) and providing detailed supplementary material, PixLens offers a scalable, component-wise approach that enables developers to diagnose and improve editing models in a realistic, post-edit evaluation setting.

Abstract

Evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI. Specifically, it is imperative to assess their capacity to execute diverse editing tasks while preserving the image content and realism. While recent developments in generative models have opened up previously unheard-of possibilities for image editing, conducting a thorough evaluation of these models remains a challenging and open task. The absence of a standardized evaluation benchmark, primarily due to the inherent need for a post-edit reference image for evaluation, further complicates this issue. Currently, evaluations often rely on established models such as CLIP or require human intervention for a comprehensive understanding of the performance of these image editing models. Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement, contributing to the advancement and refinement of existing methodologies in the field.

PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM

TL;DR

Abstract

Paper Structure (37 sections, 1 equation, 22 figures, 6 tables, 8 algorithms)

This paper contains 37 sections, 1 equation, 22 figures, 6 tables, 8 algorithms.

Introduction
Related Work
PixLens: Benchmark for Text-Guided Image Editing
Edit Quality Automatic Evaluation
Evaluated Edit Types and Pipeline Structure.
Detection and Segmentation Integration.
Subject Preservation.
Background Preservation.
Disentanglement Evaluation
Setup and Disentanglement Pipeline.
Computing Scores.
Automated Evaluation of State-of-the-Art Models
Performance Overview and Analysis
Disentanglement Analysis and Correlation
Quantitative Analysis of the Benchmark
...and 22 more sections

Figures (22)

Figure 1: PixLens Edit Evaluation Pipeline: SIZE operation evaluation example.
Figure 2: Original image (left) and edited images resulting of "changing the color of the boat to black (center) / yellow (right)", using InstructPix2Pix.
Figure 3: Good subject preservation example for InstructPix2Pix edit with the prompt Change the color of the dog to red. The subject of this edit is the dog.
Figure 4: Bad subject preservation example for LCM edit with the prompt Change the color of the backpack to orange. The subject of this edit is the backpack.
Figure 5: Intra-attribute disentanglement example
...and 17 more figures

PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM

TL;DR

Abstract

PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM

Authors

TL;DR

Abstract

Table of Contents

Figures (22)