Table of Contents
Fetching ...

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang

TL;DR

KRIS-Bench introduces a cognitively grounded benchmark for instruction-based image editing that assesses knowledge-based reasoning via a three-type knowledge taxonomy (Factual, Conceptual, Procedural), 22 tasks, and 1,267 annotated instances. It adds a Knowledge Plausibility metric with knowledge hints and human validation, enabling more reliable evaluation of real-world knowledge integration. Across 10 state-of-the-art systems, results reveal substantial gaps in knowledge-grounded editing, with procedural and domain-specific reasoning proving especially challenging. The benchmark provides a principled framework for advancing knowledge-centric image editing and highlights directions for future research in cognitively aligned, reasoning-aware editing systems.

Abstract

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

TL;DR

KRIS-Bench introduces a cognitively grounded benchmark for instruction-based image editing that assesses knowledge-based reasoning via a three-type knowledge taxonomy (Factual, Conceptual, Procedural), 22 tasks, and 1,267 annotated instances. It adds a Knowledge Plausibility metric with knowledge hints and human validation, enabling more reliable evaluation of real-world knowledge integration. Across 10 state-of-the-art systems, results reveal substantial gaps in knowledge-grounded editing, with procedural and domain-specific reasoning proving especially challenging. The benchmark provides a principled framework for advancing knowledge-centric image editing and highlights directions for future research in cognitively aligned, reasoning-aware editing systems.

Abstract

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

Paper Structure

This paper contains 23 sections, 36 figures, 3 tables.

Figures (36)

  • Figure 1: (a) We present KRIS-Bench, a benchmark for instruction-based image editing grounded in a knowledge-based reasoning taxonomy. It covers 3 knowledge dimensions, 7 reasoning dimensions, and 22 editing tasks. Specific examples are shown in Figure \ref{['fig:examples']}. (b) Given an editing pair of (image, instruction) under a specific reasoning dimension (i.e., Chemistry in Natural Science), we evaluate the output of image editing models with automated VLM tools over the proposed four complementary metrics, which are aligned with human scoring.
  • Figure 2: Representative examples from the 22 knowledge-based reasoning image editing tasks in KRIS-Bench. Each task is designed to evaluate specific knowledge grounded in factual, conceptual, or procedural, covering diverse reasoning dimensions.
  • Figure 3: Visualization results of (a) Color Change, (b) Position Movement, (c) Humanities, (d) Chemistry, and (e) Abstract Reasoning across different models and metrics. Each example is provided with scores across the four evaluation metrics as well as an overall average score. Note that the knowledge hint is provided solely for evaluation and has been shortened for better illustration.
  • Figure 4: Performance on KRIS-Bench across different editing tasks and four different metrics. Top: closed-source models. Bottom: open-source models.
  • Figure 5: Correlation between human and VLM scores across Visual Consistency (VC), Visual Quality (VQ), Instruction Following (IF), and Knowledge Plausibility (KP). We compare the prompts incorporating knowledge hints (Knowledge Prompts) with a simple baseline (Simple Prompts).
  • ...and 31 more figures