Table of Contents
Fetching ...

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu

TL;DR

Face-Human-Bench introduces a hierarchical ability taxonomy ($L1$–$L3$) and a semi-automatic data pipeline to benchmark face and human understanding for multi-modal assistants. It builds development and test sets with 1800 problems each in English and Chinese, derived from public datasets, and evaluates 25 mainstream MLLMs across diverse abilities. The study reveals strong cross-ability correlations, notable relative-position sensitivity (RPSS) across tasks, and model-dependent gains from multimodal Chain-of-Thought prompting, with closed-source GPT-4o often outperforming many open-source baselines in certain settings. The findings highlight that specialist models remain essential for tasks like deepfake detection, crowd counting, and robust face recognition, guiding practical deployments of hybrid multi-modal systems; the dataset and evaluation code are publicly available for ongoing benchmarking.

Abstract

Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench includes a development set and a test set, each with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. We also explore which abilities of MLLMs need to be supplemented by specialist models. The dataset and evaluation code have been made publicly available at https://face-human-bench.github.io.

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

TL;DR

Face-Human-Bench introduces a hierarchical ability taxonomy () and a semi-automatic data pipeline to benchmark face and human understanding for multi-modal assistants. It builds development and test sets with 1800 problems each in English and Chinese, derived from public datasets, and evaluates 25 mainstream MLLMs across diverse abilities. The study reveals strong cross-ability correlations, notable relative-position sensitivity (RPSS) across tasks, and model-dependent gains from multimodal Chain-of-Thought prompting, with closed-source GPT-4o often outperforming many open-source baselines in certain settings. The findings highlight that specialist models remain essential for tasks like deepfake detection, crowd counting, and robust face recognition, guiding practical deployments of hybrid multi-modal systems; the dataset and evaluation code are publicly available for ongoing benchmarking.

Abstract

Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench includes a development set and a test set, each with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. We also explore which abilities of MLLMs need to be supplemented by specialist models. The dataset and evaluation code have been made publicly available at https://face-human-bench.github.io.
Paper Structure (42 sections, 3 equations, 14 figures, 41 tables)

This paper contains 42 sections, 3 equations, 14 figures, 41 tables.

Figures (14)

  • Figure 1: The three-level ability taxonomy for evaluating face and human understanding abilities. We construct the Face-Human-Bench based on this taxonomy. The proportion of the sectors represents the weight of the corresponding abilities in the overall score on the Face-Human-Bench.
  • Figure 2: The leaderboard of MLLMs on our proposed Face-Human-Bench (English).
  • Figure 3: Correlation between abilities.
  • Figure 4: (a) The versions used for the three face understanding abilities. (b) The versions used for human attribute recognition.
  • Figure 5: The performance differences between the two versions across various models. For the three face understanding abilities, we show the performance of the original version minus that of the cropped version. For human attribute recognition, we show the performance of the box-added version minus that of the cropped version.
  • ...and 9 more figures