Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

Runzhou Liu; Hailey Weingord; Sejal Mittal; Prakhar Dungarwal; Anusha Nandula; Bo Ni; Samyadeep Basu; Hongjie Chen; Nesreen K. Ahmed; Li Li; Jiayi Zhang; Koustava Goswami; Subhojyoti Mukherjee; Branislav Kveton; Puneet Mathur; Franck Dernoncourt; Yue Zhao; Yu Wang; Ryan A. Rossi; Zhengzhong Tu; Hongru Du

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

Runzhou Liu, Hailey Weingord, Sejal Mittal, Prakhar Dungarwal, Anusha Nandula, Bo Ni, Samyadeep Basu, Hongjie Chen, Nesreen K. Ahmed, Li Li, Jiayi Zhang, Koustava Goswami, Subhojyoti Mukherjee, Branislav Kveton, Puneet Mathur, Franck Dernoncourt, Yue Zhao, Yu Wang, Ryan A. Rossi, Zhengzhong Tu, Hongru Du

TL;DR

It is demonstrated that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas the proposed MLLM judges provide more intuitive and informative assessments in both offline and online settings.

Abstract

Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

TL;DR

Abstract

Paper Structure (88 sections, 8 equations, 9 figures, 31 tables)

This paper contains 88 sections, 8 equations, 9 figures, 31 tables.

Introduction
Related Work
Image Editing Methods
Traditional Metrics for Image Editing
Fine-Grained MLLM Judges
Problem Formulation
Our MLLM Judge Factors
Image Preservation
Edit Quality
Instruction Fidelity
Base Models and Implementation Details
Methodology
Benchmark Collection
Participants and Procedure
Recruitment.
...and 73 more sections

Figures (9)

Figure 1: Motivation for fine-grained MLLM-based evaluation. The same image-editing example is assessed using two approaches: (left) traditional metrics, which collapse diverse editing behaviors into a single, potentially misleading score, and (right) our MLLM judge, which decomposes the edit into interpretable factors that explicitly explain why the edit succeeds or fails. This decomposition makes the motivation, methodology, and benefits of our approach immediately apparent.
Figure 2: Overview of the proposed factors used in our MLLM-as-a-Judge for image editing. Results are shown for each factor using both poorly edited images and those that were edited well (implementation in Fig. \ref{['fig:prompt-mllm-as-a-judge-image-editing, main']}).
Figure 3: Human evaluation study interface illustrating an example image-editing task. Evaluators are shown the original image, the editing instruction, and the edited image (the API generated image), and are asked to rate multiple dimensions using Likert-scale judgments. These annotations form the benchmark human evaluation dataset used in our analysis.
Figure 4: Similarity-based metrics reward global appearance and identity preservation, while fine-grained MLLM judges evaluate instruction fidelity and local edit quality, correctly diagnosing semantic edit failures.
Figure 5: Human Evaluation Study UI for Instruction-Guided Image Editing
...and 4 more figures

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

TL;DR

Abstract

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (9)