Table of Contents
Fetching ...

MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

Jihan Yao, Yushi Hu, Yujie Yi, Bin Han, Shangbin Feng, Guang Yang, Bingbing Wen, Ranjay Krishna, Lucy Lu Wang, Yulia Tsvetkov, Noah A. Smith, Banghua Zhu

TL;DR

MMMG delivers a comprehensive, human-aligned benchmark for multitask multimodal generation across image, audio, and interleaved modalities, featuring 49 tasks and 937 instructions designed to yield reliable automatic evaluation. By combining verifiable programmatic checks, carefully crafted VLM-based assessments, and audio/text metrics, MMMG achieves high alignment with human judgments and enables fine-grained analysis of model weaknesses. The study shows ARMs outperform diffusion models in image tasks but highlights substantial headroom in multimodal reasoning and audio generation, with strong correlations to real-world preferences. MMMG serves as both a leaderboard and a scalable validation signal for future multimodal model development, while recognizing limitations related to proprietary evaluator dependence and task coverage. Overall, MMMG advances the evaluation of complex multimodal generation and informs targeted research directions for more capable and reliable systems.

Abstract

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

TL;DR

MMMG delivers a comprehensive, human-aligned benchmark for multitask multimodal generation across image, audio, and interleaved modalities, featuring 49 tasks and 937 instructions designed to yield reliable automatic evaluation. By combining verifiable programmatic checks, carefully crafted VLM-based assessments, and audio/text metrics, MMMG achieves high alignment with human judgments and enables fine-grained analysis of model weaknesses. The study shows ARMs outperform diffusion models in image tasks but highlights substantial headroom in multimodal reasoning and audio generation, with strong correlations to real-world preferences. MMMG serves as both a leaderboard and a scalable validation signal for future multimodal model development, while recognizing limitations related to proprietary evaluator dependence and task coverage. Overall, MMMG advances the evaluation of complex multimodal generation and informs targeted research directions for more capable and reliable systems.

Abstract

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

Paper Structure

This paper contains 40 sections, 58 figures, 14 tables.

Figures (58)

  • Figure 1: Examples of tasks and their evaluation metrics in MMMG. For each task, we develop an evaluation metric using programs, models or their combinations. The tasks are either verifiable purely by programs or have big generation-evaluation gaps: generation is challenging for models, while automatic evaluations have high correlation with human judgments. We show evaluation pseudo-code for demonstration the evaluation process.
  • Figure 2: Benchmark results of multimodal generation models on MMMG covering four modality combinations. Please refer to Table \ref{['tab:detail']} for more detailed category information. We aggregate some sub-tasks for interleaved image-text generation. GPT Image beats all other models on most image generation tasks, and strongly competes other baselines in generating consistent image sequences and coherent interleaved image-text contents.
  • Figure 3: Two prevalent failure cases observed in interleaved image-text generation tasks for Gemini Image: (1) models fail to accurately interpret the order of images in interleaved inputs; and (2) models frequently blend multiple images together, possibly due to limitations in encoding multiple images with continuous latent image representations.
  • Figure 4: Human annotation interface for instrument inclusion task. Typically, an inference will include reference audios/images, model's generation, evaluation instruction, evaluation criteria and judgment radio boxes and next/previous button.
  • Figure 5: Examples for the task: Object Inclusion
  • ...and 53 more figures