Table of Contents
Fetching ...

HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding

Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Chaoyou Fu, Xinwei He, Xiang Bai

TL;DR

HumanVideo-MME introduces a holistic evaluation suite for multimodal large language models on human-centric video tasks, addressing gaps in prior benchmarks. It spans 13 cognitive tasks across 50 domains and supports MC, FIB, TF, and OEQ formats over videos from 10 seconds to 30 minutes, enabling robust spatiotemporal reasoning. The authors implement a two-stage automated QA annotation pipeline and a composite causal reasoning metric that combines lexical, structural, and semantic coherence scores, formalized as $Score = \alpha \cdot Score_F + \beta \cdot Score_O + \gamma \cdot Score^{norm}_G$ with defaults $\alpha=0.5$, $\beta=0.3$, $\gamma=0.5$. Benchmark results show that while open-source MLLMs excel in MC/TF formats on high-level reasoning, they struggle with generation-based FIB/OEQ tasks and fine-grained perception, revealing reliance on priors rather than genuine reasoning. These findings guide future development toward deeper, more grounded human-centric reasoning in MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 13 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.

HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding

TL;DR

HumanVideo-MME introduces a holistic evaluation suite for multimodal large language models on human-centric video tasks, addressing gaps in prior benchmarks. It spans 13 cognitive tasks across 50 domains and supports MC, FIB, TF, and OEQ formats over videos from 10 seconds to 30 minutes, enabling robust spatiotemporal reasoning. The authors implement a two-stage automated QA annotation pipeline and a composite causal reasoning metric that combines lexical, structural, and semantic coherence scores, formalized as with defaults , , . Benchmark results show that while open-source MLLMs excel in MC/TF formats on high-level reasoning, they struggle with generation-based FIB/OEQ tasks and fine-grained perception, revealing reliance on priors rather than genuine reasoning. These findings guide future development toward deeper, more grounded human-centric reasoning in MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 13 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.

Paper Structure

This paper contains 20 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of HumanVideo-MME that spans diverse human-centric scenarios (50+ domains in 10s$\sim$30mins) and covers both basic perception and advanced reasoning tasks. It supports Multiple-Choice (MC), Fill-in-Blank (FIB), True/False (TF), and Open-Ended Questions (OEQ) to comprehensively evaluate MLLMs’ understanding and cognitive capabilities.
  • Figure 2: HumanVideo-MME construction pipeline. The benchmark is built through a three-stage pipeline: (1) large-scale Video Collection across diverse human-centric domains; (2) Automated QA annotation via MLLMs and structured templates; (3) a two-tier Quality Review combining automatic filtering and expert verification to ensure annotation reliability.
  • Figure 3: Statistics of HumanVideo-MME. (a) Our benchmark covers 50+ human-centric categories across diverse domains. (b) Video durations range from short clips to long-form content. (c) Most videos are in high resolution (720P or above), supporting fine-grained visual analysis. (d) The caption vocabulary covers diverse semantic cues, emphasizing human appearance, actions, and interactions. (e) Our evaluation tasks span both perceptual and advanced reasoning abilities. (f) The statistic information of QA pairs.