HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Chaoyou Fu, Xinwei He, Xiang Bai
TL;DR
HumanVideo-MME introduces a holistic evaluation suite for multimodal large language models on human-centric video tasks, addressing gaps in prior benchmarks. It spans 13 cognitive tasks across 50 domains and supports MC, FIB, TF, and OEQ formats over videos from 10 seconds to 30 minutes, enabling robust spatiotemporal reasoning. The authors implement a two-stage automated QA annotation pipeline and a composite causal reasoning metric that combines lexical, structural, and semantic coherence scores, formalized as $Score = \alpha \cdot Score_F + \beta \cdot Score_O + \gamma \cdot Score^{norm}_G$ with defaults $\alpha=0.5$, $\beta=0.3$, $\gamma=0.5$. Benchmark results show that while open-source MLLMs excel in MC/TF formats on high-level reasoning, they struggle with generation-based FIB/OEQ tasks and fine-grained perception, revealing reliance on priors rather than genuine reasoning. These findings guide future development toward deeper, more grounded human-centric reasoning in MLLMs.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 13 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.
