Table of Contents
Fetching ...

PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition

Ibtissam Saadi, Abdenour Hadid, Douglas W. Cunningham, Abdelmalik Taleb-Ahmed, Yassin El Hillali

TL;DR

PE-CLIP tackles the high cost of adapting vision-language models for dynamic facial expression recognition by introducing a parameter-efficient framework that freezes CLIP encoders while deploying lightweight adapters (Temporal Dynamic Adapter and Shared Adapter) and Multi-modal Prompt Learning with AU-based textual prompts. The approach achieves competitive WAR and UAR on DFEW and FERV39K with only about 9 million trainable parameters, thanks to temporal modeling via a GRU-based TDA with dynamic scaling and cross-modal refinement through MaPLe. Key contributions include the TDA for short- and long-term temporal dependencies, cross-modal ShA refinement across vision/text, AU-informed textual prompts, and comprehensive ablations demonstrating the value of each component. This design offers a practical, resource-efficient path to robust DFER in real-world settings, with potential extensions to self-supervised learning and audio-visual fusion.

Abstract

Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER) but face challenges such as inefficient full fine-tuning, high complexity, and poor alignment between textual and visual representations. Additionally, existing methods struggle with ineffective temporal modeling. To address these issues, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that adapts CLIP for DFER while significantly reducing trainable parameters while maintaining high accuracy. PE-CLIP introduces two specialized adapters: a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The TDA is a GRU-based module with dynamic scaling that captures sequential dependencies while emphasizing informative temporal features and suppressing irrelevant variations. The ShA is a lightweight adapter that refines representations within both textual and visual encoders, ensuring consistency and efficiency. Additionally, we integrate Multi-modal Prompt Learning (MaPLe), introducing learnable prompts for visual and action unit-based textual inputs, enhancing semantic alignment between modalities and enabling efficient CLIP adaptation for dynamic tasks. We evaluate PE-CLIP on two benchmark datasets, DFEW and FERV39K, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .

PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition

TL;DR

PE-CLIP tackles the high cost of adapting vision-language models for dynamic facial expression recognition by introducing a parameter-efficient framework that freezes CLIP encoders while deploying lightweight adapters (Temporal Dynamic Adapter and Shared Adapter) and Multi-modal Prompt Learning with AU-based textual prompts. The approach achieves competitive WAR and UAR on DFEW and FERV39K with only about 9 million trainable parameters, thanks to temporal modeling via a GRU-based TDA with dynamic scaling and cross-modal refinement through MaPLe. Key contributions include the TDA for short- and long-term temporal dependencies, cross-modal ShA refinement across vision/text, AU-informed textual prompts, and comprehensive ablations demonstrating the value of each component. This design offers a practical, resource-efficient path to robust DFER in real-world settings, with potential extensions to self-supervised learning and audio-visual fusion.

Abstract

Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER) but face challenges such as inefficient full fine-tuning, high complexity, and poor alignment between textual and visual representations. Additionally, existing methods struggle with ineffective temporal modeling. To address these issues, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that adapts CLIP for DFER while significantly reducing trainable parameters while maintaining high accuracy. PE-CLIP introduces two specialized adapters: a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The TDA is a GRU-based module with dynamic scaling that captures sequential dependencies while emphasizing informative temporal features and suppressing irrelevant variations. The ShA is a lightweight adapter that refines representations within both textual and visual encoders, ensuring consistency and efficiency. Additionally, we integrate Multi-modal Prompt Learning (MaPLe), introducing learnable prompts for visual and action unit-based textual inputs, enhancing semantic alignment between modalities and enabling efficient CLIP adaptation for dynamic tasks. We evaluate PE-CLIP on two benchmark datasets, DFEW and FERV39K, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .

Paper Structure

This paper contains 14 sections, 15 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Performance comparison of dynamic facial expression recognition on the DFEW dataset. This chart highlights the trade-off between the number of tunable parameters (in millions) and model accuracy (measured as weighted average recall, WAR). The bubble size represents the model size, indicating the total number of trainable parameters. Our proposed framework, PE-CLIP (highlighted in red), achieves superior performance with significantly fewer tunable parameters ($<$6% of the whole model's parameters) compared to state-of-the-art methods. Models included in the comparison are IAL lee2023frame, CEFLNet liu2022clip , EC-STFL jiang2020dfew, Former-DFER zhao2021former, EST liu2023expression, DFER-CLIP zhao2023prompting, and CLIPER li2024cliper.
  • Figure 2: Overall architecture of the proposed PE-CLIP model. The model takes tokenized AU-based textual descriptions $\mathbf{\textit{C}}_{t}$ and embedded image sequences $\mathbf{\textit{X}}_{t}$, enriched with MaPLe learnable tokens as inputs. These representations are processed through CLIP encoders, where the Shared Adapter (ShA) improves the representation learning across textual-visual modalities. Additionally, the Temporal Dynamic Adapter (TDA) with dynamic scaling, captures the key temporal dependencies in the vision branch, while the Textual Adapter (TA) enhances the textual representations. The resulting visual ($\mathbf{\textit{f}}_{v}$) and textual ($\mathbf{\textit{f}}_{t}$) embeddings are mapped into CLIP’s shared space, where classification is performed via cosine similarity to associate expressions with their corresponding labels.
  • Figure 3: Attention visualization of our proposed PE-CLIP model. The figure presents attention maps for sadness and anger (left to right), showing how focus evolves across different model configurations. Each section includes original frames (first row), baseline model (second row), model with ShA adapters (third row), and full PE-CLIP with ShA and TDA (fourth row). Warmer colors indicate stronger attention, with PE-CLIP progressively refining focus on key facial regions, enhancing representation refinement (via ShA) and temporal modeling (via TDA).
  • Figure 4: t-SNE Visualization of High-Level Feature Distributions on DFEW (fd1–5): baseline Model, PE-CLIP without adapters and prompt (Top row) vs. Proposed PE-CLIP model (Bottom row).