Table of Contents
Fetching ...

Face-MLLM: A Large Face Perception Model

Haomiao Sun, Mingjie He, Tianheng Lian, Hu Han, Shiguang Shan

TL;DR

This work comprehensively evaluates existing MLLMs on face perception tasks, and develops a novel multimodal large face perception model, namely Face-MLLM, which surpasses previous MLLMs on five famous face perception tasks.

Abstract

Although multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, their ability to perceive and understand human faces is rarely explored. In this work, we comprehensively evaluate existing MLLMs on face perception tasks. The quantitative results reveal that existing MLLMs struggle to handle these tasks. The primary reason is the lack of image-text datasets that contain fine-grained descriptions of human faces. To tackle this problem, we design a practical pipeline for constructing datasets, upon which we further build a novel multimodal large face perception model, namely Face-MLLM. Specifically, we re-annotate LAION-Face dataset with more detailed face captions and facial attribute labels. Besides, we re-formulate traditional face datasets using the question-answer style, which is fit for MLLMs. Together with these enriched datasets, we develop a novel three-stage MLLM training method. In the first two stages, our model learns visual-text alignment and basic visual question answering capability, respectively. In the third stage, our model learns to handle multiple specialized face perception tasks. Experimental results show that our model surpasses previous MLLMs on five famous face perception tasks. Besides, on our newly introduced zero-shot facial attribute analysis task, our Face-MLLM also presents superior performance.

Face-MLLM: A Large Face Perception Model

TL;DR

This work comprehensively evaluates existing MLLMs on face perception tasks, and develops a novel multimodal large face perception model, namely Face-MLLM, which surpasses previous MLLMs on five famous face perception tasks.

Abstract

Although multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, their ability to perceive and understand human faces is rarely explored. In this work, we comprehensively evaluate existing MLLMs on face perception tasks. The quantitative results reveal that existing MLLMs struggle to handle these tasks. The primary reason is the lack of image-text datasets that contain fine-grained descriptions of human faces. To tackle this problem, we design a practical pipeline for constructing datasets, upon which we further build a novel multimodal large face perception model, namely Face-MLLM. Specifically, we re-annotate LAION-Face dataset with more detailed face captions and facial attribute labels. Besides, we re-formulate traditional face datasets using the question-answer style, which is fit for MLLMs. Together with these enriched datasets, we develop a novel three-stage MLLM training method. In the first two stages, our model learns visual-text alignment and basic visual question answering capability, respectively. In the third stage, our model learns to handle multiple specialized face perception tasks. Experimental results show that our model surpasses previous MLLMs on five famous face perception tasks. Besides, on our newly introduced zero-shot facial attribute analysis task, our Face-MLLM also presents superior performance.

Paper Structure

This paper contains 18 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Our Face-MLLM model demonstrates superior performance in both traditional and zero-shot face perception tasks, showcasing its robustness and versatility in handling various face perception challenges.
  • Figure 2: Training paradigm and architecture of Face-MLLM. The left side illustrates our three-stage training strategy, including representative examples of training data for each stage. The right side depicts the model's structural components, alongside an example of face description task.
  • Figure 3: The prompt for re-annotation of the LAION-Face data. This prompt can guide Gemini-1.0-Pro-Vision to perform both image caption and face attribute classification tasks concurrently.
  • Figure 4: The probability (%) that different models can provide properly formatted responses.