Table of Contents
Fetching ...

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

Lin Li, Guikun Chen, Hanrong Shi, Jun Xiao, Long Chen

TL;DR

The paper provides a comprehensive survey of 211 multimodal benchmarks for Multimodal Large Language Models, organizing them into understanding, reasoning, generation, and application. It analyzes task designs and evaluation metrics, identifies core gaps such as fragmented objectives and evolving metrics, and proposes a cohesive benchmarking framework alongside future directions. The work aims to guide researchers and practitioners in selecting and designing benchmarks that faithfully reflect MLLM capabilities and reliability. It also highlights data-collection and quality-control practices critical for robust, scalable evaluation in real-world scenarios.

Abstract

The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial advancements in artificial intelligence, significantly enhancing the capability to understand and generate multimodal content. While prior studies have largely concentrated on model architectures and training methodologies, a thorough analysis of the benchmarks used for evaluating these models remains underexplored. This survey addresses this gap by systematically reviewing 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application. We provide a detailed analysis of task designs, evaluation metrics, and dataset constructions, across diverse modalities. We hope that this survey will contribute to the ongoing advancement of MLLM research by offering a comprehensive overview of benchmarking practices and identifying promising directions for future work. An associated GitHub repository collecting the latest papers is available.

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

TL;DR

The paper provides a comprehensive survey of 211 multimodal benchmarks for Multimodal Large Language Models, organizing them into understanding, reasoning, generation, and application. It analyzes task designs and evaluation metrics, identifies core gaps such as fragmented objectives and evolving metrics, and proposes a cohesive benchmarking framework alongside future directions. The work aims to guide researchers and practitioners in selecting and designing benchmarks that faithfully reflect MLLM capabilities and reliability. It also highlights data-collection and quality-control practices critical for robust, scalable evaluation in real-world scenarios.

Abstract

The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial advancements in artificial intelligence, significantly enhancing the capability to understand and generate multimodal content. While prior studies have largely concentrated on model architectures and training methodologies, a thorough analysis of the benchmarks used for evaluating these models remains underexplored. This survey addresses this gap by systematically reviewing 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application. We provide a detailed analysis of task designs, evaluation metrics, and dataset constructions, across diverse modalities. We hope that this survey will contribute to the ongoing advancement of MLLM research by offering a comprehensive overview of benchmarking practices and identifying promising directions for future work. An associated GitHub repository collecting the latest papers is available.
Paper Structure (29 sections, 2 figures, 4 tables)