Table of Contents
Fetching ...

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Wulin Xie, Yi-Fan Zhang, Chaoyou Fu, Yang Shi, Bingyan Nie, Hongkai Chen, Zhang Zhang, Liang Wang, Tieniu Tan

TL;DR

MME-Unify addresses the lack of standardized benchmarks for Unified Multimodal LLMs by integrating traditional understanding and generation tasks with five novel unified tasks that require mixed-modality outputs. It introduces a three-domain evaluation and standardizes metrics to enable fair cross-model comparisons across 12 datasets, 10 tasks, and 30 subtasks. The empirical study of 22 models reveals persistent gaps between open-source U-MLLMs and specialized systems, with substantial challenges in balancing understanding, generation, and unified reasoning, especially for multi-step tasks like Visual CoT. This benchmark offers a practical framework and diagnostic insights to guide future research toward more cohesive cross-modal competence.

Abstract

Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

TL;DR

MME-Unify addresses the lack of standardized benchmarks for Unified Multimodal LLMs by integrating traditional understanding and generation tasks with five novel unified tasks that require mixed-modality outputs. It introduces a three-domain evaluation and standardizes metrics to enable fair cross-model comparisons across 12 datasets, 10 tasks, and 30 subtasks. The empirical study of 22 models reveals persistent gaps between open-source U-MLLMs and specialized systems, with substantial challenges in balancing understanding, generation, and unified reasoning, especially for multi-step tasks like Visual CoT. This benchmark offers a practical framework and diagnostic insights to guide future research toward more cohesive cross-modal competence.

Abstract

Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.

Paper Structure

This paper contains 20 sections, 30 equations, 25 figures, 5 tables.

Figures (25)

  • Figure 1: A comprehensive visualization of the diverse tasks in MME-U and the leaderboard. The figure (a) illustrates the wide-ranging nature of the tasks covered in our benchmark, which spans from traditional understanding tasks to complex mixed-modality generation challenges. Additionally, the leaderboard (b) highlights the performance rankings of various U-MLLMs in our benchmark.
  • Figure 2: Complex instruction-based image generation comparison of results from open-source U-MLLMs (DeepSeek-Janus Flow, EMU3), closed-source U-MLLMs (GPT-4o, Gemini-2), and proprietary models (DALLE-3). The closed-source U-MLLMs have demonstrated abilities surpassing proprietary generation models, with a significantly larger gap compared to open-source models.
  • Figure 3: Diagram of our MME-Unify. Our benchmark consists of 3 main domains, encompassing 15 subtasks to comprehensively evaluate U-MLLMs' understanding, generation, and unified capabilities. Specifically, each unify task includes at least one question, an input image, multiple text choices, and image choices. The image choices consist of a correct answer image and a set of manually crafted negative samples. During the evaluation process, we input the image, question, and text options, and the U-MLLMs are required to select the correct text answer and generate an image. The text answer is evaluated by matching it with the correct answer, while the generated image is compared with the constructed image choices. If the CLIP score between the generated image and the correct answer image is the highest, it is considered correct; otherwise, it is deemed incorrect.
  • Figure 4: Accuracy distribution across different dimensions on visual cot task. (a) action, (b) location, and (c) image.
  • Figure 5: The generated results from various models in the text-to-image generation task, based on the following text prompt: A man is standing in a park with a 'Run for Rights' banner in the background. He is wearing a white t-shirt with the number 28 on it, grey shorts, and grey socks with black shoes. The park is filled with people, some sitting on benches, and there is a bicycle leaning against a tree.
  • ...and 20 more figures