MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models
Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, Jiebo Luo
TL;DR
MMIG-Bench addresses the fragmented evaluation of multi-modal image generation by unifying text- and image-conditioned prompts into a single benchmark. It pairs 1,750 multi-view reference images across 380 subjects with 4,850 prompts and employs a three-level evaluation framework: low-level artifact and identity checks, a novel mid-level Aspect Matching Score based on VQA, and high-level aesthetics measures. The paper introduces data-curation pipelines (including GPT-4o prompting and FineMatch labeling) and demonstrates strong alignment with human judgments (AMS correlates with human ratings at ρ ≈ 0.699) across 17 modern models, providing actionable insights for architecture and data design. MMIG-Bench aims to accelerate research on multimodal generation by releasing data, code, and a leaderboard, enabling comprehensive and interpretable comparisons beyond traditional T2I metrics.
Abstract
Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design.
