MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Wulin Xie; Yi-Fan Zhang; Chaoyou Fu; Yang Shi; Bingyan Nie; Hongkai Chen; Zhang Zhang; Liang Wang; Tieniu Tan

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Wulin Xie, Yi-Fan Zhang, Chaoyou Fu, Yang Shi, Bingyan Nie, Hongkai Chen, Zhang Zhang, Liang Wang, Tieniu Tan

TL;DR

MME-Unify addresses the lack of standardized benchmarks for Unified Multimodal LLMs by integrating traditional understanding and generation tasks with five novel unified tasks that require mixed-modality outputs. It introduces a three-domain evaluation and standardizes metrics to enable fair cross-model comparisons across 12 datasets, 10 tasks, and 30 subtasks. The empirical study of 22 models reveals persistent gaps between open-source U-MLLMs and specialized systems, with substantial challenges in balancing understanding, generation, and unified reasoning, especially for multi-step tasks like Visual CoT. This benchmark offers a practical framework and diagnostic insights to guide future research toward more cohesive cross-modal competence.

Abstract

Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

TL;DR

Abstract

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (25)