MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation
Jihan Yao, Yushi Hu, Yujie Yi, Bin Han, Shangbin Feng, Guang Yang, Bingbing Wen, Ranjay Krishna, Lucy Lu Wang, Yulia Tsvetkov, Noah A. Smith, Banghua Zhu
TL;DR
MMMG delivers a comprehensive, human-aligned benchmark for multitask multimodal generation across image, audio, and interleaved modalities, featuring 49 tasks and 937 instructions designed to yield reliable automatic evaluation. By combining verifiable programmatic checks, carefully crafted VLM-based assessments, and audio/text metrics, MMMG achieves high alignment with human judgments and enables fine-grained analysis of model weaknesses. The study shows ARMs outperform diffusion models in image tasks but highlights substantial headroom in multimodal reasoning and audio generation, with strong correlations to real-world preferences. MMMG serves as both a leaderboard and a scalable validation signal for future multimodal model development, while recognizing limitations related to proprietary evaluator dependence and task coverage. Overall, MMMG advances the evaluation of complex multimodal generation and informs targeted research directions for more capable and reliable systems.
Abstract
Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.
