Table of Contents
Fetching ...

ExpLLM: Towards Chain of Thought for Facial Expression Recognition

Xing Lan, Jian Xue, Ji Qi, Dongmei Jiang, Ke Lu, Tat-Seng Chua

TL;DR

A novel method called ExpLLM is proposed, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition, and outperforms current state-of-the-art FER methods.

Abstract

Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.

ExpLLM: Towards Chain of Thought for Facial Expression Recognition

TL;DR

A novel method called ExpLLM is proposed, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition, and outperforms current state-of-the-art FER methods.

Abstract

Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.
Paper Structure (35 sections, 1 equation, 5 figures, 5 tables)

This paper contains 35 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overall structure of the proposed methods. The upper part illustrates the ExpLLM. By inputting both the instruction and the facial image into the ExpLLM, it sequentially generates the Expression CoT. Blue text represents emotional expressions, whereas red text denotes facial action units and their corresponding degrees. The lower part features the Exp-CoT Engine, which is designed to construct the CoT from three perspectives: key observations, overall emotional interpretation, and conclusion. The Exp-CoT Engine utilizes the AU results to generate the expression CoT.
  • Figure 2: Overview of the Exp-CoT Engine: This engine comprises six processes—Img2AU Regression, AU2Des Generation, Des2Exp Generation, Result Verification, Feedback Reflection, and Format Refinement. It utilizes advanced models, including the AU model and GPT-4o, to convert facial images into Action Units to generate detailed CoT of facial expressions. The red text represent the AU name and its corresponding intensity, while the blue text denote the potential emotions.
  • Figure 3: A flowchart illustrating the ExpLLM methodology for analyzing facial expressions in images. The process begins with a visual encoder that utilizes LoRA to extract facial features from the image. The visual tokens are then projected and processed through an embedding and tokenizer before being fed into a large language model. The model can generate expression labels or detailed descriptions based on the analysis of the facial expression depicted in the image.
  • Figure 4: Qualitative Comparison between ExpLLM and GPT-4o. First column: face image and ground truth expression label. Second column: CoT generated by Exp-CoT Engine. Third column: CoT generated by ExpLLM. Fourth column: CoT generated by GPT-4o.The blue expression state is consistent with the ground truth, while the red expression state is incorrect.
  • Figure 5: Visualize the ExpLLM generated samples among eight different expressions.left is the face image and expression label, right is the generated CoT. The red text indicates the AU name and its corresponding intensity, while the blue text represents the associated potential expression.