Table of Contents
Fetching ...

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Jiulong Wu, Yucheng Shen, Lingyong Yan, Haixin Sun, Deguo Xia, Jizhou Huang, Min Cao

TL;DR

Facial-R1 addresses fundamental interpretability gaps in Facial Emotion Analysis by decomposing emotion understanding into AU recognition and AU-based reasoning while ensuring alignment with the final emotion. It introduces a three-stage training pipeline—instruction-tuned supervised finetuning, verifiable-reward reinforcement learning grounded in AUs and emotion labels, and iterative data synthesis to scale training—along with the large-scale FEA-20K benchmark. Across eight benchmarks, Facial-R1 delivers state-of-the-art or competitive performance on AU recognition, emotion recognition, and emotion reasoning, while improving interpretability through structured reasoning outputs. The approach offers scalable, minimally supervised progress toward trustworthy, explainable affective understanding in real-world settings.

Abstract

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

TL;DR

Facial-R1 addresses fundamental interpretability gaps in Facial Emotion Analysis by decomposing emotion understanding into AU recognition and AU-based reasoning while ensuring alignment with the final emotion. It introduces a three-stage training pipeline—instruction-tuned supervised finetuning, verifiable-reward reinforcement learning grounded in AUs and emotion labels, and iterative data synthesis to scale training—along with the large-scale FEA-20K benchmark. Across eight benchmarks, Facial-R1 delivers state-of-the-art or competitive performance on AU recognition, emotion recognition, and emotion reasoning, while improving interpretability through structured reasoning outputs. The approach offers scalable, minimally supervised progress toward trustworthy, explainable affective understanding in real-world settings.

Abstract

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

Paper Structure

This paper contains 46 sections, 8 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Illustration of facial emotion analysis task. Unlike traditional facial emotion recognition, which directly outputs a predicted emotion (e.g., disgust), facial emotion analysis decomposes the task into three interrelated sub-tasks: (1) Facial Action Unit (AU) Recognition, where local facial muscle movements (e.g., AU4: slight frown ...) are identified; (2) AU-based Emotion Reasoning, which generates natural language explanations linking the detected AUs to the predicted emotion; and (3) Facial Emotion Recognition, producing the final emotion label. Together, these results enable an explainable and interpretable emotion recognition, bridging the gap between low-level visual cues and high-level affective understanding.
  • Figure 2: The Facial-R1 framework consists of three stages: (1) Supervised finetuning (SFT) mitigates hallucinations by establishing basic emotion reasoning capability; (2) Reinforcement Learning (RL) leverages verifiable emotional facts as reward signals to build reasonable and flexible reasoning process; (3) Data Synthesis iteratively leverages the prior two stages to expand the training dataset, enabling scalable self-improvement of the model.
  • Figure 3: Visualizing the samples generated by Facial-R1 across eight different emotions. On the left are the facial image, emotion label, and AU labels; on the right are the incorrect reasoning from Qwen2.5-VL-7B and our correct reasoning process. The red text indicates misidentified AUs or emotion labels, while the green text represents the correct reasoning.