Table of Contents
Fetching ...

Facial Affective Behavior Analysis with Instruction Tuning

Yifan Li, Anh Dao, Wentao Bao, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong

TL;DR

The paper tackles the challenge of enabling fine-grained, interpretable facial affective behavior analysis (FABA) with instruction-tuned multi-modal large language models (MLLMs). It introduces FABA-Instruct, an instruction-following dataset, and FABA-Bench, a benchmark with a unified REGE metric that blends recognition and generation. It then presents EmoLA, an instruction-tuned MLLM augmented with a facial-prior expert and LoRA-based adapters, achieving state-of-the-art or competitive results across traditional FER/AUR datasets and strong performance on FABA-Bench with high data efficiency. The work demonstrates that incorporating facial priors and instruction tuning unlocks rich, explainable descriptions of facial affect and sets a foundation for expanding FABA to additional tasks and modalities.

Abstract

Facial affective behavior analysis (FABA) is crucial for understanding human mental states from images. However, traditional approaches primarily deploy models to discriminate among discrete emotion categories, and lack the fine granularity and reasoning capability for complex facial behaviors. The advent of Multi-modal Large Language Models (MLLMs) has been proven successful in general visual understanding tasks. However, directly harnessing MLLMs for FABA is challenging due to the scarcity of datasets and benchmarks, neglecting facial prior knowledge, and low training efficiency. To address these challenges, we introduce (i) an instruction-following dataset for two FABA tasks, e.g., emotion and action unit recognition, (ii) a benchmark FABA-Bench with a new metric considering both recognition and generation ability, and (iii) a new MLLM "EmoLA" as a strong baseline to the community. Our initiative on the dataset and benchmarks reveal the nature and rationale of facial affective behaviors, i.e., fine-grained facial movement, interpretability, and reasoning. Moreover, to build an effective and efficient FABA MLLM, we introduce a facial prior expert module with face structure knowledge and a low-rank adaptation module into pre-trained MLLM. We conduct extensive experiments on FABA-Bench and four commonly-used FABA datasets. The results demonstrate that the proposed facial prior expert can boost the performance and EmoLA achieves the best results on our FABA-Bench. On commonly-used FABA datasets, EmoLA is competitive rivaling task-specific state-of-the-art models.

Facial Affective Behavior Analysis with Instruction Tuning

TL;DR

The paper tackles the challenge of enabling fine-grained, interpretable facial affective behavior analysis (FABA) with instruction-tuned multi-modal large language models (MLLMs). It introduces FABA-Instruct, an instruction-following dataset, and FABA-Bench, a benchmark with a unified REGE metric that blends recognition and generation. It then presents EmoLA, an instruction-tuned MLLM augmented with a facial-prior expert and LoRA-based adapters, achieving state-of-the-art or competitive results across traditional FER/AUR datasets and strong performance on FABA-Bench with high data efficiency. The work demonstrates that incorporating facial priors and instruction tuning unlocks rich, explainable descriptions of facial affect and sets a foundation for expanding FABA to additional tasks and modalities.

Abstract

Facial affective behavior analysis (FABA) is crucial for understanding human mental states from images. However, traditional approaches primarily deploy models to discriminate among discrete emotion categories, and lack the fine granularity and reasoning capability for complex facial behaviors. The advent of Multi-modal Large Language Models (MLLMs) has been proven successful in general visual understanding tasks. However, directly harnessing MLLMs for FABA is challenging due to the scarcity of datasets and benchmarks, neglecting facial prior knowledge, and low training efficiency. To address these challenges, we introduce (i) an instruction-following dataset for two FABA tasks, e.g., emotion and action unit recognition, (ii) a benchmark FABA-Bench with a new metric considering both recognition and generation ability, and (iii) a new MLLM "EmoLA" as a strong baseline to the community. Our initiative on the dataset and benchmarks reveal the nature and rationale of facial affective behaviors, i.e., fine-grained facial movement, interpretability, and reasoning. Moreover, to build an effective and efficient FABA MLLM, we introduce a facial prior expert module with face structure knowledge and a low-rank adaptation module into pre-trained MLLM. We conduct extensive experiments on FABA-Bench and four commonly-used FABA datasets. The results demonstrate that the proposed facial prior expert can boost the performance and EmoLA achieves the best results on our FABA-Bench. On commonly-used FABA datasets, EmoLA is competitive rivaling task-specific state-of-the-art models.
Paper Structure (29 sections, 2 equations, 14 figures, 18 tables)

This paper contains 29 sections, 2 equations, 14 figures, 18 tables.

Figures (14)

  • Figure 1: An illustration of FABA-Instruct annotations. FABA-Instruct can provide fine-grained emotion and AU descriptions, which not only include the reasoning process about the facial movements but also present the inference to the emotion. Furthermore, compared to traditional category labels, FABA-Instruct has more abundant expressions to describe complex, nuanced, exaggerated, and undefined affective behaviors.
  • Figure 2: FABA-Instruct annotation.
  • Figure 3: Emotion description analysis. Emotion descriptions can express comprehensive emotion types like compositional emotions, exaggerated emotions, the degree of emotions, and undefined emotions, etc. In contrast, emotion categories struggle to capture such complex and nuanced emotional states.
  • Figure 4: AU description analysis. AU descriptions give not only the AU labels, but also provide explanations on the cause (which muscle movement) and effect (which emotion it will lead to) w.r.t. each AU, and the relationship between current AU and other AUs or emotions.
  • Figure 5: The synonyms of emotions for classifying the text.
  • ...and 9 more figures