FaceXBench: Evaluating Multimodal LLMs on Face Understanding
Kartik Narayan, Vibashan VS, Vishal M. Patel
TL;DR
FaceXBench introduces a comprehensive, standardized benchmark for evaluating multimodal LLMs on face understanding, spanning 14 tasks across 6 categories and built from 25 public datasets plus a new FaceXAPI tool-use dataset. The 5,000 MCQ questions, with 10,441 unique images, enable deterministic, comparable evaluation under zero-shot, in-context, and chain-of-thought settings, using a reproducible VLMEvalKit-based pipeline. Experimental results show that no model exceeds 60% accuracy, with open-source MLLMs often outperforming proprietary ones in several face-understanding facets, and highlight notable weaknesses in bias, deepfake detection, crowd counting, and low-resolution recognition. The authors argue for future emphasis on targeted supervised fine-tuning and enhanced tool use to advance MLLMs’ intrinsic face understanding, positioning FaceXBench as a pivotal resource for progress tracking and method development in this domain.
Abstract
Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving abilities across a wide range of tasks and domains. However, their capacity for face understanding has not been systematically studied. To address this gap, we introduce FaceXBench, a comprehensive benchmark designed to evaluate MLLMs on complex face understanding tasks. FaceXBench includes 5,000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6 broad categories, assessing MLLMs' face understanding abilities in bias and fairness, face authentication, recognition, analysis, localization and tool retrieval. Using FaceXBench, we conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks. We analyze the models across three evaluation settings: zero-shot, in-context task description, and chain-of-thought prompting. Our detailed analysis reveals that current MLLMs, including advanced models like GPT-4o, and GeminiPro 1.5, show significant room for improvement. We believe FaceXBench will be a crucial resource for developing MLLMs equipped to perform sophisticated face understanding. Code: https://github.com/Kartik-3004/facexbench
