Table of Contents
Fetching ...

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs

Xiaoqin Wang, Xusen Ma, Xianxu Hou, Meidan Ding, Yudong Li, Junliang Chen, Wenting Chen, Xiaoyang Peng, Linlin Shen

TL;DR

FaceBench addresses the lack of comprehensive face-perception benchmarks for multimodal LLMs by proposing a hierarchical, multi-view, multi-level facial attribute VQA dataset. It defines five views and three levels, yielding 211 attributes and 701 values, with 15,842 images and 73,760 VQA pairs across five views; 194 templates drive diverse questioning. The authors train Face-LLaVA, a specialized baseline, on FaceBench's training data and show that current open-source MLLMs underperform on fine-grained facial attributes, while Face-LLaVA achieves competitive results approaching commercial models in several views. The dataset and model results provide a foundation for robust benchmarking and targeted improvement of face-perception capabilities in MLLMs.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs

TL;DR

FaceBench addresses the lack of comprehensive face-perception benchmarks for multimodal LLMs by proposing a hierarchical, multi-view, multi-level facial attribute VQA dataset. It defines five views and three levels, yielding 211 attributes and 701 values, with 15,842 images and 73,760 VQA pairs across five views; 194 templates drive diverse questioning. The authors train Face-LLaVA, a specialized baseline, on FaceBench's training data and show that current open-source MLLMs underperform on fine-grained facial attributes, while Face-LLaVA achieves competitive results approaching commercial models in several views. The dataset and model results provide a foundation for robust benchmarking and targeted improvement of face-perception capabilities in MLLMs.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.

Paper Structure

This paper contains 12 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of FaceBench. Left: Example of face images, including a face region mask. Center: FaceBench covers multi-views (appearance, accessories, surrounding, identity, psychology). Each view contains multi-level attributes (Level 1, Level 2, Level 3), comprising over 210 attributes and 700 attribute values in total. Right: Q&A of our Face-LLaVA finetuned using the FaceBench. Best viewed in color.
  • Figure 2: Hierarchical organization of facial attributes. We categorize facial attributes into Appearance, Identity, Surrounding, Accessories, and Psychology, illustrating their hierarchical structure across three levels. Best viewed in color.
  • Figure 3: Question types and human annotation workflow for building our dataset. Best viewed in color.
  • Figure 4: Samples from our FaceBench dataset. It displays a range of VQA pairs from our dataset aimed at evaluating the perception of facial attributes categorized into Appearance, Accessories, Surrounding, Identity, and Psychology.