Compound Expression Recognition via Large Vision-Language Models
Jun Yu, Xilong Lu
TL;DR
This work tackles Compound Expression Recognition (CER) from facial images, addressing multi-emotion complexity and real-world variability. It introduces a multimodal LVLM-based method with stage-wise Low-Rank Adaptation (LoRA) fine-tuning that first learns basic emotions and then specializes on compound expressions, guided by carefully designed context prompts; LoRA updates satisfy $\Delta W \approx BA$, yielding a parameter-efficient regime with updates on the order of $2dr$ instead of $d^2$. The approach achieves strong results on RAF-DB and Aff-Wild2 while displaying robust zero-shot generalization on C-EXPR-DB, illustrating practical potential for real-world emotion analysis and human-computer interaction. Together, the stage-wise fine-tuning and context-aware prompting enable accurate multimodal CER with reduced computational cost, enabling deployment in constrained settings.
Abstract
Compound Expression Recognition (CER) is crucial for understanding human emotions and improving human-computer interaction. However, CER faces challenges due to the complexity of facial expressions and the difficulty of capturing subtle emotional cues. To address these issues, we propose a novel approach leveraging Large Vision-Language Models (LVLMs). Our method employs a two-stage fine-tuning process: first, pre-trained LVLMs are fine-tuned on basic facial expressions to establish foundational patterns; second, the model is further optimized on a compound-expression dataset to refine visual-language feature interactions. Our approach achieves advanced accuracy on the RAF-DB dataset and demonstrates strong zero-shot generalization on the C-EXPR-DB dataset, showcasing its potential for real-world applications in emotion analysis and human-computer interaction.
