Table of Contents
Fetching ...

MMGenBench: Fully Automatically Evaluating LMMs from the Text-to-Image Generation Perspective

Hailang Huang, Yong Wang, Zixuan Huang, Huaqiu Li, Tongwen Huang, Xiangxiang Chu, Richong Zhang

TL;DR

MMGenBench introduces a fully automated pipeline to evaluate LMMs on image understanding and detailed image description by converting images into textual prompts, generating auxiliary images via text-to-image models, and comparing embeddings with representational metrics. The benchmark suite comprises MMGenBench-Test (13 image patterns, 1,284 images) and MMGenBench-Domain (10,000 domain images), enabling broad, domain-agnostic assessment across 50+ LMMs. Key findings show that strong models on existing benchmarks often underperform on basic understanding tasks, and model size alone is not a reliable predictor of capability; SIM-Score is proposed as the primary metric, with FID-Scores providing a secondary measure of generative alignment. The framework offers a scalable, low-human-effort approach for ongoing benchmarking across evolving LMMs and domains, highlighting concrete areas for improvement in instruction-following, descriptive depth, and resistance to overfitting.

Abstract

Large Multimodal Models (LMMs) demonstrate impressive capabilities. However, current benchmarks predominantly focus on image comprehension in specific domains, and these benchmarks are labor-intensive to construct. Moreover, their answers tend to be brief, making it difficult to assess the ability of LMMs to generate detailed descriptions of images. To address these limitations, we propose the MMGenBench-Pipeline, a straightforward and fully automated evaluation pipeline. This involves generating textual descriptions from input images, using these descriptions to create auxiliary images via text-to-image generative models, and then comparing the original and generated images. Furthermore, to ensure the effectiveness of MMGenBench-Pipeline, we design MMGenBench-Test, evaluating LMMs across 13 distinct image patterns, and MMGenBench-Domain, focusing on generative image performance. A thorough evaluation involving over 50 popular LMMs demonstrates the effectiveness and reliability of both the pipeline and benchmark. Our observations indicate that numerous LMMs excelling in existing benchmarks fail to adequately complete the basic tasks related to image understanding and description. This finding highlights the substantial potential for performance improvement in current LMMs and suggests avenues for future model optimization. Concurrently, MMGenBench-Pipeline can efficiently assess the performance of LMMs across diverse domains using only image inputs.

MMGenBench: Fully Automatically Evaluating LMMs from the Text-to-Image Generation Perspective

TL;DR

MMGenBench introduces a fully automated pipeline to evaluate LMMs on image understanding and detailed image description by converting images into textual prompts, generating auxiliary images via text-to-image models, and comparing embeddings with representational metrics. The benchmark suite comprises MMGenBench-Test (13 image patterns, 1,284 images) and MMGenBench-Domain (10,000 domain images), enabling broad, domain-agnostic assessment across 50+ LMMs. Key findings show that strong models on existing benchmarks often underperform on basic understanding tasks, and model size alone is not a reliable predictor of capability; SIM-Score is proposed as the primary metric, with FID-Scores providing a secondary measure of generative alignment. The framework offers a scalable, low-human-effort approach for ongoing benchmarking across evolving LMMs and domains, highlighting concrete areas for improvement in instruction-following, descriptive depth, and resistance to overfitting.

Abstract

Large Multimodal Models (LMMs) demonstrate impressive capabilities. However, current benchmarks predominantly focus on image comprehension in specific domains, and these benchmarks are labor-intensive to construct. Moreover, their answers tend to be brief, making it difficult to assess the ability of LMMs to generate detailed descriptions of images. To address these limitations, we propose the MMGenBench-Pipeline, a straightforward and fully automated evaluation pipeline. This involves generating textual descriptions from input images, using these descriptions to create auxiliary images via text-to-image generative models, and then comparing the original and generated images. Furthermore, to ensure the effectiveness of MMGenBench-Pipeline, we design MMGenBench-Test, evaluating LMMs across 13 distinct image patterns, and MMGenBench-Domain, focusing on generative image performance. A thorough evaluation involving over 50 popular LMMs demonstrates the effectiveness and reliability of both the pipeline and benchmark. Our observations indicate that numerous LMMs excelling in existing benchmarks fail to adequately complete the basic tasks related to image understanding and description. This finding highlights the substantial potential for performance improvement in current LMMs and suggests avenues for future model optimization. Concurrently, MMGenBench-Pipeline can efficiently assess the performance of LMMs across diverse domains using only image inputs.

Paper Structure

This paper contains 26 sections, 5 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: The MMGenBench-Test consists of $13$ distinct image patterns, each of which includes several images. The text, accompanied by a corresponding pattern, serves as a concise explanation of that specific image pattern. Please refer to the Appendix \ref{['appendix:MMGenBench-Test-Data-Details']} for more details.
  • Figure 2: Comparison between previous benchmarks and MMGenBench. MMGenBench has several novel features: 1) Based on powerful text-to-image models and image representation models, MMGenBench can fully automatically complete the evaluation of LMMs without the need for expensive manual annotation; 2) MMGenBench can easily evaluate the performance of LMMs in any domain, whereas previous benchmarks could mostly only evaluate the performance in specific domains; 3) The "answer" to previous benchmarks were mostly brief, overlooking the basic ability to generate detailed descriptions of images.
  • Figure 3: An overview of the MMGenBench-pipeline, illustrating the fully automated evaluation process. It starts by receiving user input (including the task instruction prompt and input images), and then generates the corresponding textual descriptions of input images. Subsequently, this process is followed by using a powerful text-to-image model to generate auxiliary images, then produces the representation of the input images and the generated ones using an image representation model, and finally outputs the evaluation score of LMMs.
  • Figure 4: Statistics of MMGenBench-Test, which contains $13$ image patterns with $1,284$ images. More details are in Sec. \ref{['sec:MMGenBench-Benchmark-Construction']}.
  • Figure 5: An overview of the MMGenBench-Test benchmark construction process. We first use GPT-4o to extract the image patterns from the input images. Then, we use GPT-4 Turbo to summarize these patterns and manually select $13$ patterns. Subsequently, GPT-4o is employed again to re-annotate these patterns. These annotations are reviewed and modified to produce the final result by human annotators.
  • ...and 16 more figures