Table of Contents
Fetching ...

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

Shibo Hong, Boxian Ai, Jun Kuang, Wei Wang, FengJiao Chen, Zhongyuan Peng, Chenhao Huang, Yixin Cao

TL;DR

This paper introduces DeepLookEditBench, the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects and proposes an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency.

Abstract

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

TL;DR

This paper introduces DeepLookEditBench, the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects and proposes an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency.

Abstract

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.
Paper Structure (25 sections, 2 equations, 19 figures, 31 tables)

This paper contains 25 sections, 2 equations, 19 figures, 31 tables.

Figures (19)

  • Figure 1: An example illustrating the challenge of small-scale editing. Given the instruction to edit the green scarf, Gemini-3-Pro misidentifies the target and modifies the foreground object instead.
  • Figure 2: Overview of the three-stage data transformation pipeline. We begin by selecting a raw visual reasoning sample from V*-Bench, specifically one that inquires about the color of a woman's scarf. In Stage 1, we employ a counterfactual synthesis strategy via GPT-4.1 to generate the image-edit metadata (as shown in the bottom red box). Subsequently, Stage 2 utilizes a crop-and-edit strategy with Gemini-3-Pro to generate the reference image. Finally, all data is submitted to Stage 3 for rigorous human verification.
  • Figure 2: IAA on DLEBench. We report Krippendorff's Alpha ($\alpha$) for both IF and VC across different instruction types.
  • Figure 3: Comparison of the Cumulative Distribution Function (CDF) of target object area ratios across different instruction-based image editing benchmarks.
  • Figure 4: Distribution of target object area ratios in DLEBench. The bars represent the number of samples in different area intervals, with specific counts annotated above each bar.
  • ...and 14 more figures