Table of Contents
Fetching ...

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Lorenzo Baraldi, Davide Bucciarelli, Federico Betti, Marcella Cornia, Lorenzo Baraldi, Nicu Sebe, Rita Cucchiara

TL;DR

The paper tackles the challenge of evaluating instruction-guided image edits by introducing DICE, a two-stage framework that first detects object-level differences between an original and edited image and then assesses the coherence of each modification with the user instruction. Built on autoregressive Multimodal LLMs, DICE employs self-supervised pretraining on similar image pairs and inpainting-based distillation to learn robust difference detection, followed by coherence estimation with textual rationale. Through extensive experiments and a dedicated dataset (DICE-D), the approach demonstrates strong alignment with human judgments and improves the reliability of CLIP-based metrics when filtering coherent versus non-coherent edits. The work contributes an interpretable, open-source evaluation pipeline that can rank editing models and guide development toward more faithful and explainable instruction-based edits.

Abstract

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

TL;DR

The paper tackles the challenge of evaluating instruction-guided image edits by introducing DICE, a two-stage framework that first detects object-level differences between an original and edited image and then assesses the coherence of each modification with the user instruction. Built on autoregressive Multimodal LLMs, DICE employs self-supervised pretraining on similar image pairs and inpainting-based distillation to learn robust difference detection, followed by coherence estimation with textual rationale. Through extensive experiments and a dedicated dataset (DICE-D), the approach demonstrates strong alignment with human judgments and improves the reliability of CLIP-based metrics when filtering coherent versus non-coherent edits. The work contributes an interpretable, open-source evaluation pipeline that can rank editing models and guide development toward more faithful and explainable instruction-based edits.

Abstract

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.

Paper Structure

This paper contains 17 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Qualitative examples from DICE. Our approach detects differences between an original image and an edited one, identifying the involved objects and the type of edit. Further, DICE evaluates each difference to determine its coherence with the editing prompt.
  • Figure 2: Illustration of DICE. We employ an MLLM and fine-tune it for two different tasks. In the first stage (difference detection), the MLLM is trained to detect semantic differences between the original image and the edited one. In the second stage (coherence estimation), the MLLM is instructed to analyze and assess the coherence of each detected difference with respect to the given user prompt.
  • Figure 3: Qualitative samples of DICE applied on images edited by MGIE fu2024guiding and InstructDiffusion geng2024instructdiffusion models.
  • Figure 4: User study interface displaying the original and the edited image alongside the editing prompt.
  • Figure 5: Additional qualitative results. Each instruction-based edit shows the original image (left) and the edited version (right), alongside the given prompt.
  • ...and 2 more figures