Table of Contents
Fetching ...

Towards Unified Facial Action Unit Recognition Framework by Large Language Models

Guohong Hu, Xing Lan, Hanyu Jiang, Jiayi Lyu, Jian Xue

TL;DR

AU-LLaVA presents a unified AU recognition framework that leverages large language model reasoning to process visual AU cues via a visual encoder, a linear projector, and an instruction-tuned LLM. The model outputs a multi-format AU vector $oldsymbol{A}$, with each element $A^i$ computed as $A^i = oldsymbol{\Phi}_L\big( \boldsymbol{\Phi}_P( \boldsymbol{\oldsymbol{\

Abstract

Facial Action Units (AUs) are of great significance in the realm of affective computing. In this paper, we propose AU-LLaVA, the first unified AU recognition framework based on the Large Language Model (LLM). AU-LLaVA consists of a visual encoder, a linear projector layer, and a pre-trained LLM. We meticulously craft the text descriptions and fine-tune the model on various AU datasets, allowing it to generate different formats of AU recognition results for the same input image. On the BP4D and DISFA datasets, AU-LLaVA delivers the most accurate recognition results for nearly half of the AUs. Our model achieves improvements of F1-score up to 11.4% in specific AU recognition compared to previous benchmark results. On the FEAFA dataset, our method achieves significant improvements over all 24 AUs compared to previous benchmark results. AU-LLaVA demonstrates exceptional performance and versatility in AU recognition.

Towards Unified Facial Action Unit Recognition Framework by Large Language Models

TL;DR

AU-LLaVA presents a unified AU recognition framework that leverages large language model reasoning to process visual AU cues via a visual encoder, a linear projector, and an instruction-tuned LLM. The model outputs a multi-format AU vector , with each element computed as $A^i = oldsymbol{\Phi}_L\big( \boldsymbol{\Phi}_P( \boldsymbol{\oldsymbol{\

Abstract

Facial Action Units (AUs) are of great significance in the realm of affective computing. In this paper, we propose AU-LLaVA, the first unified AU recognition framework based on the Large Language Model (LLM). AU-LLaVA consists of a visual encoder, a linear projector layer, and a pre-trained LLM. We meticulously craft the text descriptions and fine-tune the model on various AU datasets, allowing it to generate different formats of AU recognition results for the same input image. On the BP4D and DISFA datasets, AU-LLaVA delivers the most accurate recognition results for nearly half of the AUs. Our model achieves improvements of F1-score up to 11.4% in specific AU recognition compared to previous benchmark results. On the FEAFA dataset, our method achieves significant improvements over all 24 AUs compared to previous benchmark results. AU-LLaVA demonstrates exceptional performance and versatility in AU recognition.
Paper Structure (14 sections, 3 equations, 2 figures, 4 tables)

This paper contains 14 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Versatility of AU-LLaVA results. For a single input image, the model demonstrates multi-modal capabilities: (a) binary AU detection (0 or 1), (b) discrete AU intensity levels (0-5), (c) continuous AU intensity values (0-1). AU-LLaVA is a unified AU recognition framework based on LLM.
  • Figure 2: Architectural framework of AU-LLaVA, which comprises three primary components: a visual encode, a linear projector, and a pretrained LLM. AU-LLaVA processes facial images and textual descriptions as inputs, generating an array where each element corresponds to a specific Action Unit. During the training phase, Low-Rank Adaptation (LoRA) modules are integrated into both the visual encoder and the LLM to enhance efficiency.