MMGenBench: Fully Automatically Evaluating LMMs from the Text-to-Image Generation Perspective
Hailang Huang, Yong Wang, Zixuan Huang, Huaqiu Li, Tongwen Huang, Xiangxiang Chu, Richong Zhang
TL;DR
MMGenBench introduces a fully automated pipeline to evaluate LMMs on image understanding and detailed image description by converting images into textual prompts, generating auxiliary images via text-to-image models, and comparing embeddings with representational metrics. The benchmark suite comprises MMGenBench-Test (13 image patterns, 1,284 images) and MMGenBench-Domain (10,000 domain images), enabling broad, domain-agnostic assessment across 50+ LMMs. Key findings show that strong models on existing benchmarks often underperform on basic understanding tasks, and model size alone is not a reliable predictor of capability; SIM-Score is proposed as the primary metric, with FID-Scores providing a secondary measure of generative alignment. The framework offers a scalable, low-human-effort approach for ongoing benchmarking across evolving LMMs and domains, highlighting concrete areas for improvement in instruction-following, descriptive depth, and resistance to overfitting.
Abstract
Large Multimodal Models (LMMs) demonstrate impressive capabilities. However, current benchmarks predominantly focus on image comprehension in specific domains, and these benchmarks are labor-intensive to construct. Moreover, their answers tend to be brief, making it difficult to assess the ability of LMMs to generate detailed descriptions of images. To address these limitations, we propose the MMGenBench-Pipeline, a straightforward and fully automated evaluation pipeline. This involves generating textual descriptions from input images, using these descriptions to create auxiliary images via text-to-image generative models, and then comparing the original and generated images. Furthermore, to ensure the effectiveness of MMGenBench-Pipeline, we design MMGenBench-Test, evaluating LMMs across 13 distinct image patterns, and MMGenBench-Domain, focusing on generative image performance. A thorough evaluation involving over 50 popular LMMs demonstrates the effectiveness and reliability of both the pipeline and benchmark. Our observations indicate that numerous LMMs excelling in existing benchmarks fail to adequately complete the basic tasks related to image understanding and description. This finding highlights the substantial potential for performance improvement in current LMMs and suggests avenues for future model optimization. Concurrently, MMGenBench-Pipeline can efficiently assess the performance of LMMs across diverse domains using only image inputs.
