Table of Contents
Fetching ...

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

Runzhou Liu, Hailey Weingord, Sejal Mittal, Prakhar Dungarwal, Anusha Nandula, Bo Ni, Samyadeep Basu, Hongjie Chen, Nesreen K. Ahmed, Li Li, Jiayi Zhang, Koustava Goswami, Subhojyoti Mukherjee, Branislav Kveton, Puneet Mathur, Franck Dernoncourt, Yue Zhao, Yu Wang, Ryan A. Rossi, Zhengzhong Tu, Hongru Du

TL;DR

It is demonstrated that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas the proposed MLLM judges provide more intuitive and informative assessments in both offline and online settings.

Abstract

Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

TL;DR

It is demonstrated that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas the proposed MLLM judges provide more intuitive and informative assessments in both offline and online settings.

Abstract

Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.
Paper Structure (88 sections, 8 equations, 9 figures, 31 tables)

This paper contains 88 sections, 8 equations, 9 figures, 31 tables.

Figures (9)

  • Figure 1: Motivation for fine-grained MLLM-based evaluation. The same image-editing example is assessed using two approaches: (left) traditional metrics, which collapse diverse editing behaviors into a single, potentially misleading score, and (right) our MLLM judge, which decomposes the edit into interpretable factors that explicitly explain why the edit succeeds or fails. This decomposition makes the motivation, methodology, and benefits of our approach immediately apparent.
  • Figure 2: Overview of the proposed factors used in our MLLM-as-a-Judge for image editing. Results are shown for each factor using both poorly edited images and those that were edited well (implementation in Fig. \ref{['fig:prompt-mllm-as-a-judge-image-editing, main']}).
  • Figure 3: Human evaluation study interface illustrating an example image-editing task. Evaluators are shown the original image, the editing instruction, and the edited image (the API generated image), and are asked to rate multiple dimensions using Likert-scale judgments. These annotations form the benchmark human evaluation dataset used in our analysis.
  • Figure 4: Similarity-based metrics reward global appearance and identity preservation, while fine-grained MLLM judges evaluate instruction fidelity and local edit quality, correctly diagnosing semantic edit failures.
  • Figure 5: Human Evaluation Study UI for Instruction-Guided Image Editing
  • ...and 4 more figures