Table of Contents
Fetching ...

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang

Abstract

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

Abstract

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.
Paper Structure (26 sections, 23 figures, 10 tables)

This paper contains 26 sections, 23 figures, 10 tables.

Figures (23)

  • Figure 1: Examples of images generated by state-of-the-art editing models on GRADE and their evaluation results. Challenging discipline-informed image editing exposes limitations of current models in complex knowledge reasoning. Notable performance gaps exist across models on GRADE.
  • Figure 2: Overview of GRADE. GRADE contains 520 discipline-informed image editing samples across ten academic disciplines.
  • Figure 3: Evaluation pipeline. We evaluate edited results on (A) Discipline Reasoning via weighted, question-guided MLLM judging, (B) Visual Consistency with task-specific prompts (localized/style/independence), and (C) Logical Readability for clarity and text/annotation correctness.
  • Figure 4: Qualitative comparison of six representative high-performing models.
  • Figure 5: Error analysis of Nano Banana Pro. GT: Ground truth image. Error types include: (a) Image recognition error: mis-parsed structural cues. (b) Knowledge error: failure to activate domain-specific priors produces semantically invalid elements. (c) Reasoning process error: correct methodology but flawed multi-step execution violates constraints. (d) Generation process error: correct planning but failure to enforce hard constraints during synthesis.
  • ...and 18 more figures