Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond
Fan Zhang, Haoxuan Li, Shengju Qian, Xin Wang, Zheng Lian, Hao Wu, Zhihong Zhu, Yuan Gao, Qiankun Li, Yefeng Zheng, Zhouchen Lin, Pheng-Ann Heng
TL;DR
This work tackles unifying facial expression recognition (FER) under multimodal large language models (MLLMs) by introducing FerBench, a benchmark built on $11{,}072$ VQA-formatted samples from four FER datasets to evaluate 20 MLLMs. It then proposes two large-scale datasets, UniFer-RLVR-360K for reinforcement learning with verifiable rewards and UniFer-CoT-230K for cold-start supervised fine-tuning, and a two-stage post-training pipeline to create UniFer-7B, a unified and interpretable FER foundation model. UniFer-7B delivers state-of-the-art FER performance on FerBench with $68.84 ext{%}$ overall accuracy and provides complete reasoning trajectories, surpassing both open- and closed-source baselines and demonstrating strong interpretability. The work delivers a practical path toward unified, interpretable FER in the MLLM era and suggests extending multimodal reasoning to video and omnichannel affective computing tasks.
Abstract
Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).
