Towards Unified Facial Action Unit Recognition Framework by Large Language Models

Guohong Hu; Xing Lan; Hanyu Jiang; Jiayi Lyu; Jian Xue

Towards Unified Facial Action Unit Recognition Framework by Large Language Models

Guohong Hu, Xing Lan, Hanyu Jiang, Jiayi Lyu, Jian Xue

TL;DR

AU-LLaVA presents a unified AU recognition framework that leverages large language model reasoning to process visual AU cues via a visual encoder, a linear projector, and an instruction-tuned LLM. The model outputs a multi-format AU vector $oldsymbol{A}$, with each element $A^i$ computed as $A^i = oldsymbol{\Phi}_L\big( \boldsymbol{\Phi}_P( \boldsymbol{\oldsymbol{\

Abstract

Facial Action Units (AUs) are of great significance in the realm of affective computing. In this paper, we propose AU-LLaVA, the first unified AU recognition framework based on the Large Language Model (LLM). AU-LLaVA consists of a visual encoder, a linear projector layer, and a pre-trained LLM. We meticulously craft the text descriptions and fine-tune the model on various AU datasets, allowing it to generate different formats of AU recognition results for the same input image. On the BP4D and DISFA datasets, AU-LLaVA delivers the most accurate recognition results for nearly half of the AUs. Our model achieves improvements of F1-score up to 11.4% in specific AU recognition compared to previous benchmark results. On the FEAFA dataset, our method achieves significant improvements over all 24 AUs compared to previous benchmark results. AU-LLaVA demonstrates exceptional performance and versatility in AU recognition.

Towards Unified Facial Action Unit Recognition Framework by Large Language Models

TL;DR

, with each element

computed as $A^i = oldsymbol{\Phi}_L\big( \boldsymbol{\Phi}_P( \boldsymbol{\oldsymbol{\

Abstract

Paper Structure (14 sections, 3 equations, 2 figures, 4 tables)

This paper contains 14 sections, 3 equations, 2 figures, 4 tables.

Introduction
Related Work
AU Recognition
Large Language Model on Vison Tasks
The Proposed Method
Overview
Text Description
LoRA Modules
Experiments
Datasets
Implementation Details
Ablation Study
Comparison with state-of-the-art Methods
Conclusion

Figures (2)

Figure 1: Versatility of AU-LLaVA results. For a single input image, the model demonstrates multi-modal capabilities: (a) binary AU detection (0 or 1), (b) discrete AU intensity levels (0-5), (c) continuous AU intensity values (0-1). AU-LLaVA is a unified AU recognition framework based on LLM.
Figure 2: Architectural framework of AU-LLaVA, which comprises three primary components: a visual encode, a linear projector, and a pretrained LLM. AU-LLaVA processes facial images and textual descriptions as inputs, generating an array where each element corresponds to a specific Action Unit. During the training phase, Low-Rank Adaptation (LoRA) modules are integrated into both the visual encoder and the LLM to enhance efficiency.

Towards Unified Facial Action Unit Recognition Framework by Large Language Models

TL;DR

Abstract

Towards Unified Facial Action Unit Recognition Framework by Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)