Table of Contents
Fetching ...

Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

Fan Zhang, Haoxuan Li, Shengju Qian, Xin Wang, Zheng Lian, Hao Wu, Zhihong Zhu, Yuan Gao, Qiankun Li, Yefeng Zheng, Zhouchen Lin, Pheng-Ann Heng

TL;DR

This work tackles unifying facial expression recognition (FER) under multimodal large language models (MLLMs) by introducing FerBench, a benchmark built on $11{,}072$ VQA-formatted samples from four FER datasets to evaluate 20 MLLMs. It then proposes two large-scale datasets, UniFer-RLVR-360K for reinforcement learning with verifiable rewards and UniFer-CoT-230K for cold-start supervised fine-tuning, and a two-stage post-training pipeline to create UniFer-7B, a unified and interpretable FER foundation model. UniFer-7B delivers state-of-the-art FER performance on FerBench with $68.84 ext{%}$ overall accuracy and provides complete reasoning trajectories, surpassing both open- and closed-source baselines and demonstrating strong interpretability. The work delivers a practical path toward unified, interpretable FER in the MLLM era and suggests extending multimodal reasoning to video and omnichannel affective computing tasks.

Abstract

Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).

Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

TL;DR

This work tackles unifying facial expression recognition (FER) under multimodal large language models (MLLMs) by introducing FerBench, a benchmark built on VQA-formatted samples from four FER datasets to evaluate 20 MLLMs. It then proposes two large-scale datasets, UniFer-RLVR-360K for reinforcement learning with verifiable rewards and UniFer-CoT-230K for cold-start supervised fine-tuning, and a two-stage post-training pipeline to create UniFer-7B, a unified and interpretable FER foundation model. UniFer-7B delivers state-of-the-art FER performance on FerBench with overall accuracy and provides complete reasoning trajectories, surpassing both open- and closed-source baselines and demonstrating strong interpretability. The work delivers a practical path toward unified, interpretable FER in the MLLM era and suggests extending multimodal reasoning to video and omnichannel affective computing tasks.

Abstract

Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).

Paper Structure

This paper contains 19 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An illustration of traditional FER models, general-purpose MLLMs, and our proposed specialized FER model.
  • Figure 2: An overview of our proposed FerBench. We incorporate 11K facial images and 20 cutting-edge MLLMs for open and fair evaluation. The top-performing model (i.e., Gemini-2.5-Flash) only achieves $61.55\%$ accuracy on FerBench.
  • Figure 3: The confusion matrices of 20 evaluated MLLMs across various emotion categories on FerBench.
  • Figure 4: An overview of data curation and post-training pipeline. We curate two large-scale and high-quality datasets, and employ them for two-stage post-training, resulting in a unified and interpretable FER foundation model, UniFer-7B.
  • Figure 5: Task-level comparison (in $\%$) across the baseline model, previous SOTA, and our UniFer-7B.
  • ...and 2 more figures