PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition
Ibtissam Saadi, Abdenour Hadid, Douglas W. Cunningham, Abdelmalik Taleb-Ahmed, Yassin El Hillali
TL;DR
PE-CLIP tackles the high cost of adapting vision-language models for dynamic facial expression recognition by introducing a parameter-efficient framework that freezes CLIP encoders while deploying lightweight adapters (Temporal Dynamic Adapter and Shared Adapter) and Multi-modal Prompt Learning with AU-based textual prompts. The approach achieves competitive WAR and UAR on DFEW and FERV39K with only about 9 million trainable parameters, thanks to temporal modeling via a GRU-based TDA with dynamic scaling and cross-modal refinement through MaPLe. Key contributions include the TDA for short- and long-term temporal dependencies, cross-modal ShA refinement across vision/text, AU-informed textual prompts, and comprehensive ablations demonstrating the value of each component. This design offers a practical, resource-efficient path to robust DFER in real-world settings, with potential extensions to self-supervised learning and audio-visual fusion.
Abstract
Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER) but face challenges such as inefficient full fine-tuning, high complexity, and poor alignment between textual and visual representations. Additionally, existing methods struggle with ineffective temporal modeling. To address these issues, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that adapts CLIP for DFER while significantly reducing trainable parameters while maintaining high accuracy. PE-CLIP introduces two specialized adapters: a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The TDA is a GRU-based module with dynamic scaling that captures sequential dependencies while emphasizing informative temporal features and suppressing irrelevant variations. The ShA is a lightweight adapter that refines representations within both textual and visual encoders, ensuring consistency and efficiency. Additionally, we integrate Multi-modal Prompt Learning (MaPLe), introducing learnable prompts for visual and action unit-based textual inputs, enhancing semantic alignment between modalities and enabling efficient CLIP adaptation for dynamic tasks. We evaluate PE-CLIP on two benchmark datasets, DFEW and FERV39K, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .
