Table of Contents
Fetching ...

Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning

Ashutosh Chaubey, Xulang Guan, Mohammad Soleymani

TL;DR

Face-LLaVA tackles the challenge of versatile, reasoning-enabled face analysis by integrating a face-focused visual encoder with instruction-tuning. It introduces FaceInstruct-1M, a large-scale dataset spanning five face tasks on both images and videos, generated via Gemini and GPT-based filtering, and leverages a two-stage training scheme to align facial landmarks with language outputs. The architecture employs FRLP and FRGCA to encode region-specific landmark information into the LLM context, enabling zero-shot and competitive supervised performance across nine benchmarks and five tasks, with GPT-based evaluation confirming strong reasoning capabilities. The work offers a practical step toward general-purpose social AI and provides resources (dataset and model) to support further research in vision-language understanding of faces while acknowledging ethical considerations and limitations.

Abstract

The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model wil be released at https://face-llava.github.io to support future advancements in social AI and foundational vision-language research.

Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning

TL;DR

Face-LLaVA tackles the challenge of versatile, reasoning-enabled face analysis by integrating a face-focused visual encoder with instruction-tuning. It introduces FaceInstruct-1M, a large-scale dataset spanning five face tasks on both images and videos, generated via Gemini and GPT-based filtering, and leverages a two-stage training scheme to align facial landmarks with language outputs. The architecture employs FRLP and FRGCA to encode region-specific landmark information into the LLM context, enabling zero-shot and competitive supervised performance across nine benchmarks and five tasks, with GPT-based evaluation confirming strong reasoning capabilities. The work offers a practical step toward general-purpose social AI and provides resources (dataset and model) to support further research in vision-language understanding of faces while acknowledging ethical considerations and limitations.

Abstract

The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model wil be released at https://face-llava.github.io to support future advancements in social AI and foundational vision-language research.

Paper Structure

This paper contains 48 sections, 7 equations, 27 figures, 12 tables.

Figures (27)

  • Figure 1: A sample conversation with Face-LLaVA highlighting different face tasks that our model is capable of performing.
  • Figure 2: FaceInstruct-1M dataset samples for different tasks.
  • Figure 3: Proposed Face-LLaVA architecture. We group landmark points into different face regions and project them through a face-region landmark projector. The visual tokens are enriched by the landmark tokens through cross-attention and then passed as input to the LLM.
  • Figure 4: Comparison of Face-LLaVA with other MLLM baselines on the task of facial expression recognition. Red text indicates misaligned text and blue text indicates aligned/correct text according to the video and ground truth label.
  • Figure 5: Data annotation pipeline used for creating FaceInstruct-1M dataset.
  • ...and 22 more figures