Table of Contents
Fetching ...

Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

Yuansen Liu, Haiming Tang, Jinlong Peng, Jiangning Zhang, Xiaozhong Ji, Qingdong He, Wenbin Wu, Donghao Luo, Zhenye Gan, Junwei Zhu, Yunhang Shen, Chaoyou Fu, Chengjie Wang, Xiaobin Hu, Shuicheng Yan

TL;DR

Human-MME presents a holistic benchmark to evaluate human-centric multimodal LLMs, spanning 43 fine-grained visual scenarios and eight evaluation dimensions that progress from granular perception to high-level reasoning across 19,945 QA pairs. The framework combines automated annotation with expert manual curation, enabling multi-image and multi-person mutual understanding via diverse question types and robust grounding tasks. Extensive benchmarking on 17 MLLMs reveals grounding-trained models excel in localization, model scale boosts choice/ranking tasks, and high-level reasoning remains challenging, with clear distinctions across architectures and training data. The benchmark and its accompanying annotation pipeline, data, and code are intended to drive future advances toward more robust, human-centric image understanding and reasoning in MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.

Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

TL;DR

Human-MME presents a holistic benchmark to evaluate human-centric multimodal LLMs, spanning 43 fine-grained visual scenarios and eight evaluation dimensions that progress from granular perception to high-level reasoning across 19,945 QA pairs. The framework combines automated annotation with expert manual curation, enabling multi-image and multi-person mutual understanding via diverse question types and robust grounding tasks. Extensive benchmarking on 17 MLLMs reveals grounding-trained models excel in localization, model scale boosts choice/ranking tasks, and high-level reasoning remains challenging, with clear distinctions across architectures and training data. The benchmark and its accompanying annotation pipeline, data, and code are intended to drive future advances toward more robust, human-centric image understanding and reasoning in MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.

Paper Structure

This paper contains 37 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of Human-MME: The progressive and diverse evaluation dimensions can be divided into eight aspects from the human-oriented granular dimension perception (e.g., face, body, human-object interaction understanding) to higher-dimension reasoning (e.g., multi-image and multi-person understanding, intention, emotion, cause discrimination).
  • Figure 2: Curation of Human-MME consists of: (1) Data collection to provide images for annotation and QA generation (Section \ref{['subsec:data_source']}); (2) Automated annotation to provide the original feature set for each person $i$ in image (Section \ref{['subsec:auto_anno']}); (3) Manual adjustment to ensure annotation quality for final question-answer construction (Section \ref{['subsec:manual']}); (4) Construction of question-answer pairs using the features extracted in the image (Section \ref{['subsec:qad']}, not shown in the figure).
  • Figure 3: Human-MME demonstrates rich diversity: Images: (a) shows a sunburst chart of image content across four domains and subcategories; (b) indicates image resolutions ranging from below 480p to over 4K; (c) presents the number of people per image, from single- to multi-person scenes; (d) illustrates a word cloud of HOI objects, covering interactions with hundreds of distinct objects. QA Pairs: (e) illustrates multiple question types distributed across eight reasoning dimensions; (f) shows the lengths of questions, capturing variability in question complexity.
  • Figure 4: Correlation between model size and performance in different question components. To minimize the influence of differences in model architecture and training strategies, only the models ranking in the top half of the overall performance table (Table \ref{['tbl:dim_score']}) are selected for this analysis. These models all belong to the GLM, Qwen, and Intern families.
  • Figure 5: Confusion matrices for body and face grounding tasks. This figure presents the confusion matrices of three MLLMs on the Body Grounding and Face Grounding tasks. All images containing any overlap or ambiguity between left and right hands or feet are removed in advance to ensure that the evaluation focuses purely on the models’ ability to distinguish left-right body and facial parts.
  • ...and 2 more figures