Table of Contents
Fetching ...

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, Shengyu Zhang

TL;DR

UnicEdit-10M introduces a 10M-scale dataset for image editing that spans basic and complex tasks using an end-to-end editing pipeline with unified post-verification. It pairs with UnicBench, a diagnostic benchmark with novel metrics (Non-edit Consistency, Reasoning Accuracy) to evaluate instruction following, non-target changes, visual quality, and reasoning fidelity. A compact 7B expert, Qwen-Verify, performs failure detection and instruction recaptioning to enable scalable data curation without costly APIs. The results reveal limitations in current models' reasoning and spatial editing, especially among open-source approaches, and provide a pathway for targeted data and evaluation improvements to close the gap with closed-source systems.

Abstract

With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textit{Non-edit Consistency} and \textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

TL;DR

UnicEdit-10M introduces a 10M-scale dataset for image editing that spans basic and complex tasks using an end-to-end editing pipeline with unified post-verification. It pairs with UnicBench, a diagnostic benchmark with novel metrics (Non-edit Consistency, Reasoning Accuracy) to evaluate instruction following, non-target changes, visual quality, and reasoning fidelity. A compact 7B expert, Qwen-Verify, performs failure detection and instruction recaptioning to enable scalable data curation without costly APIs. The results reveal limitations in current models' reasoning and spatial editing, especially among open-source approaches, and provide a pathway for targeted data and evaluation improvements to close the gap with closed-source systems.

Abstract

With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textit{Non-edit Consistency} and \textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.

Paper Structure

This paper contains 38 sections, 6 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: UnicEdit-10M covers 22 edit tasks spanning basic and complex edits, with a unified post-verification stage that filters failures and refines instructions to yield high-quality triplets. We also introduce UnicBench with fine-grained metrics for comprehensive evaluation.
  • Figure 2: Representative examples of all sub-tasks from UnicEdit-10M.
  • Figure 3: Data curation pipeline with three stages: (1) data preparation, (2) image editing, (3) post verification performing failed edits filtration and recaption.
  • Figure 4: Post-verification examples of the expert model. Base denotes Qwen2.5-VL-7B; SFT denotes Base model after Stage-1 SFT; Ours denotes the dual-task expert model Qwen-Verify.
  • Figure 5: Qualitative comparison of data curation pipelines. (a) shows example triplets from ImgEdit ye2025imgedit. (b) shows example from SEED-Data-Edit ge2024seeddataedit. In each subfigure, the left columns display original triplets, while the right columns show the same images reprocessed through Our Pipeline. Red boxes highlight regions of blurring, artifacts, and color inconsistencies in the originals. Our pipeline consistently yields higher quality and more precisely aligned instructions, demonstrating the effectiveness of our unified verification process.
  • ...and 10 more figures