ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

Yulin Pan; Xiangteng He; Chaojie Mao; Zhen Han; Zeyinzi Jiang; Jingfeng Zhang; Yu Liu

ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

Yulin Pan, Xiangteng He, Chaojie Mao, Zhen Han, Zeyinzi Jiang, Jingfeng Zhang, Yu Liu

TL;DR

ICE-Bench addresses the fragmented landscape of image-generation evaluation by proposing a unified benchmark that covers 31 fine-grained creating and editing tasks, structured as coarse-to-fine categories. It combines six evaluation dimensions with eleven metrics, including a novel VLLM-QA component, and uses a 6,538-instance hybrid dataset to assess both real and synthetic data, across ten models. The study demonstrates broad gaps in current models’ generality and highlights how model capacity, data quality, and evaluation design influence imaging fidelity and instruction adherence. By open-sourcing data, code, and models, ICE-Bench aims to standardize and accelerate progress toward versatile, real-world image creation and editing systems.

Abstract

Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements. To foster further advancements in the field, we will open-source ICE-Bench, including its dataset, evaluation code, and models, thereby providing a valuable resource for the research community.

ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

TL;DR

Abstract

ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)