FaceXBench: Evaluating Multimodal LLMs on Face Understanding

Kartik Narayan; Vibashan VS; Vishal M. Patel

FaceXBench: Evaluating Multimodal LLMs on Face Understanding

Kartik Narayan, Vibashan VS, Vishal M. Patel

TL;DR

FaceXBench introduces a comprehensive, standardized benchmark for evaluating multimodal LLMs on face understanding, spanning 14 tasks across 6 categories and built from 25 public datasets plus a new FaceXAPI tool-use dataset. The 5,000 MCQ questions, with 10,441 unique images, enable deterministic, comparable evaluation under zero-shot, in-context, and chain-of-thought settings, using a reproducible VLMEvalKit-based pipeline. Experimental results show that no model exceeds 60% accuracy, with open-source MLLMs often outperforming proprietary ones in several face-understanding facets, and highlight notable weaknesses in bias, deepfake detection, crowd counting, and low-resolution recognition. The authors argue for future emphasis on targeted supervised fine-tuning and enhanced tool use to advance MLLMs’ intrinsic face understanding, positioning FaceXBench as a pivotal resource for progress tracking and method development in this domain.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving abilities across a wide range of tasks and domains. However, their capacity for face understanding has not been systematically studied. To address this gap, we introduce FaceXBench, a comprehensive benchmark designed to evaluate MLLMs on complex face understanding tasks. FaceXBench includes 5,000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6 broad categories, assessing MLLMs' face understanding abilities in bias and fairness, face authentication, recognition, analysis, localization and tool retrieval. Using FaceXBench, we conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks. We analyze the models across three evaluation settings: zero-shot, in-context task description, and chain-of-thought prompting. Our detailed analysis reveals that current MLLMs, including advanced models like GPT-4o, and GeminiPro 1.5, show significant room for improvement. We believe FaceXBench will be a crucial resource for developing MLLMs equipped to perform sophisticated face understanding. Code: https://github.com/Kartik-3004/facexbench

FaceXBench: Evaluating Multimodal LLMs on Face Understanding

TL;DR

Abstract

Paper Structure (35 sections, 4 figures, 8 tables)

This paper contains 35 sections, 4 figures, 8 tables.

Introduction
Related Work
The FaceXBench Benchmark
Overview of FaceXBench
Data Collection
FaceXAPI
Quality Control
Experiments
Models
Evaluation Settings
Implementation Details
Results
Discussion and Future Directions
Conclusion
Motivation
...and 20 more sections

Figures (4)

Figure 2: FaceXBench examples cover a total of 14 tasks, addressing various aspects of face understanding. Each question may consist of single or multiple images. Every question includes four options, with only one correct answer. The options are strategically designed to prompt the model to analyze carefully before selecting an option.
Figure 3: Distribution of questions across different categories and tasks in FaceXBench.
Figure 4: (a) Performance of top-5 models ($4$B-$13$B parameters) across various tasks. (b) Effect of LLM and it's size on model performance. (c) Average performance of the top-5 models ($4$B-$13$B parameters) on multiple-image and single-image questions.
Figure 10.1: Collage of a subset of images from the dataset, showcasing the diversity of images used in FaceXBench.

FaceXBench: Evaluating Multimodal LLMs on Face Understanding

TL;DR

Abstract

FaceXBench: Evaluating Multimodal LLMs on Face Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (4)