Table of Contents
Fetching ...

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, Jiebo Luo

TL;DR

MMIG-Bench addresses the fragmented evaluation of multi-modal image generation by unifying text- and image-conditioned prompts into a single benchmark. It pairs 1,750 multi-view reference images across 380 subjects with 4,850 prompts and employs a three-level evaluation framework: low-level artifact and identity checks, a novel mid-level Aspect Matching Score based on VQA, and high-level aesthetics measures. The paper introduces data-curation pipelines (including GPT-4o prompting and FineMatch labeling) and demonstrates strong alignment with human judgments (AMS correlates with human ratings at ρ ≈ 0.699) across 17 modern models, providing actionable insights for architecture and data design. MMIG-Bench aims to accelerate research on multimodal generation by releasing data, code, and a leaderboard, enabling comprehensive and interpretable comparisons beyond traditional T2I metrics.

Abstract

Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design.

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

TL;DR

MMIG-Bench addresses the fragmented evaluation of multi-modal image generation by unifying text- and image-conditioned prompts into a single benchmark. It pairs 1,750 multi-view reference images across 380 subjects with 4,850 prompts and employs a three-level evaluation framework: low-level artifact and identity checks, a novel mid-level Aspect Matching Score based on VQA, and high-level aesthetics measures. The paper introduces data-curation pipelines (including GPT-4o prompting and FineMatch labeling) and demonstrates strong alignment with human judgments (AMS correlates with human ratings at ρ ≈ 0.699) across 17 modern models, providing actionable insights for architecture and data design. MMIG-Bench aims to accelerate research on multimodal generation by releasing data, code, and a leaderboard, enabling comprehensive and interpretable comparisons beyond traditional T2I metrics.

Abstract

Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design.

Paper Structure

This paper contains 33 sections, 1 equation, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Overview of MMIG-Bench. We present a unified multi-modal benchmark which contains 1,750 multi-view reference images with 4,850 richly annotated text prompts, covering both text-only and image-text-conditioned generation. We also propose a comprehensive three-level evaluation framework: low-level of artifacts and identity preservation, mid-level of VQA-based Aspect Matching Score, and high-level of aesthetics and human preferences—delivers holistic and interpretable scores.
  • Figure 2: Statistics of the tags in MMIG-Bench. Top-left: Data distribution of compositional categories and high-level categories for text in T2I task. Bottom-left: Data distribution of text prompts in customization task. Right: Statistics of classes for the reference images.
  • Figure 3: Our data curation pipeline for multi-modal image generation benchmarking. We begin by extracting 207 frequent entities from public T2I datasets. Using these entities, we generate diverse prompts with GPT-4o by prompting it with a set of carefully designed instruction templates, which control the structure and style of the prompts (left). Simultaneously, we collect grouped reference images for each entity from free stock sources, with human annotators selecting 3–5 object-centric images per group that vary in pose or view (right). We further collect artistic images in 12 visual styles to support style transfer. The resulting dataset includes high-quality, structured text-image pairs for both T2I and customization.
  • Figure 4: A qualitative study of text-only (top) and text-image-conditioned (bottom) generation methods on MMIG-Bench.
  • Figure 5: Word clouds of text prompts for the text-only generation (T2I) task (left) and the multimodal generation task (right).
  • ...and 11 more figures