UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

Keming Ye; Zhipeng Huang; Canmiao Fu; Qingyang Liu; Jiani Cai; Zheqi Lv; Chen Li; Jing Lyu; Zhou Zhao; Shengyu Zhang

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, Shengyu Zhang

TL;DR

UnicEdit-10M introduces a 10M-scale dataset for image editing that spans basic and complex tasks using an end-to-end editing pipeline with unified post-verification. It pairs with UnicBench, a diagnostic benchmark with novel metrics (Non-edit Consistency, Reasoning Accuracy) to evaluate instruction following, non-target changes, visual quality, and reasoning fidelity. A compact 7B expert, Qwen-Verify, performs failure detection and instruction recaptioning to enable scalable data curation without costly APIs. The results reveal limitations in current models' reasoning and spatial editing, especially among open-source approaches, and provide a pathway for targeted data and evaluation improvements to close the gap with closed-source systems.

Abstract

With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textit{Non-edit Consistency} and \textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

TL;DR

Abstract

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)