Table of Contents
Fetching ...

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

Bohao Xing, Zitong Yu, Xin Liu, Kaishen Yuan, Qilang Ye, Weicheng Xie, Huanjing Yue, Jingyu Yang, Heikki Kälviäinen

TL;DR

This work tackles generalization gaps and the lack of semantic alignment in facial expression recognition (FER) by introducing EMO-LLaMA, a multimodal large language model augmented with facial priors. It combines a Face Info Mining module, FACE priors (embedding, landmarks, AGR), and a Q-Former-based clue aggregator to convert visual cues into instruction-ready tokens for an LLM, tuned via LoRA on LLaMA-VID. A Gemini-driven pipeline generates a large FER instruction dataset across image and video modalities, enabling robust instruction tuning. Across six FER datasets and cross-modality tests, EMO-LLaMA achieves SOTA-comparable performance and demonstrates notable generalization, signaling the potential of instruction-tuned MLLMs for unified multimodal emotion understanding and future extensions to speech and audio modalities.

Abstract

Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimodal emotion understanding and human-computer interaction difficult. Multimodal Large Language Models (MLLMs) have recently achieved success, offering advantages in addressing these issues and potentially overcoming the limitations of current FER paradigms. However, directly applying pre-trained MLLMs to FER still faces several challenges. Our zero-shot evaluations of existing open-source MLLMs on FER indicate a significant performance gap compared to GPT-4V and current supervised state-of-the-art (SOTA) methods. In this paper, we aim to enhance MLLMs' capabilities in understanding facial expressions. We first generate instruction data for five FER datasets with Gemini. We then propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information. Specifically, we design a Face Info Mining module to extract both global and local facial information. Additionally, we utilize a handcrafted prompt to introduce age-gender-race attributes, considering the emotional differences across different human groups. Extensive experiments show that EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets. The instruction dataset and code are available at https://github.com/xxtars/EMO-LLaMA.

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

TL;DR

This work tackles generalization gaps and the lack of semantic alignment in facial expression recognition (FER) by introducing EMO-LLaMA, a multimodal large language model augmented with facial priors. It combines a Face Info Mining module, FACE priors (embedding, landmarks, AGR), and a Q-Former-based clue aggregator to convert visual cues into instruction-ready tokens for an LLM, tuned via LoRA on LLaMA-VID. A Gemini-driven pipeline generates a large FER instruction dataset across image and video modalities, enabling robust instruction tuning. Across six FER datasets and cross-modality tests, EMO-LLaMA achieves SOTA-comparable performance and demonstrates notable generalization, signaling the potential of instruction-tuned MLLMs for unified multimodal emotion understanding and future extensions to speech and audio modalities.

Abstract

Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimodal emotion understanding and human-computer interaction difficult. Multimodal Large Language Models (MLLMs) have recently achieved success, offering advantages in addressing these issues and potentially overcoming the limitations of current FER paradigms. However, directly applying pre-trained MLLMs to FER still faces several challenges. Our zero-shot evaluations of existing open-source MLLMs on FER indicate a significant performance gap compared to GPT-4V and current supervised state-of-the-art (SOTA) methods. In this paper, we aim to enhance MLLMs' capabilities in understanding facial expressions. We first generate instruction data for five FER datasets with Gemini. We then propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information. Specifically, we design a Face Info Mining module to extract both global and local facial information. Additionally, we utilize a handcrafted prompt to introduce age-gender-race attributes, considering the emotional differences across different human groups. Extensive experiments show that EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets. The instruction dataset and code are available at https://github.com/xxtars/EMO-LLaMA.
Paper Structure (10 sections, 6 equations, 4 figures, 9 tables)

This paper contains 10 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparison of zero-shot and previous supervised SOTA on several FER datasets. Blue represents image dataset, and Purple represents video dataset.
  • Figure 2: The framework of EMO-LLaMA. The model first obtains a face image from a face detection network, which is then fed into a facial analysis expert to extract facial prior knowledge. A Clue Aggregator extracts task-specific embedding. General visual features are enhanced by facial features in the Face Info Mining module. Additionally, landmark embedding and handcrafted prompts of facial evidence are further utilized to enhance face information as input to the LLM.
  • Figure 3: Face Info Mining Module.
  • Figure 4: The examples of our generated instruction data. Zoom in for a closer look, and more examples can be found in the Appendix.