Table of Contents
Fetching ...

CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition

Muhammad Osama Zeeshan, Masoumeh Sharafi, Benoît Savary, Alessandro Lameiras Koerich, Marco Pedersoli, Eric Granger

Abstract

Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP's contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER. Our code is publicly available at: https://github.com/osamazeeshan/CLIP-AUTT.

CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition

Abstract

Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP's contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER. Our code is publicly available at: https://github.com/osamazeeshan/CLIP-AUTT.

Paper Structure

This paper contains 13 sections, 13 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) Existing Video ER methods based on CLIP fine-tuning use class prompts or generate LLM prompts that remain fixed at inference time, providing no subject-specific adaptation. (b) Our CLIP-AU introduces AU-guided pretraining to learn fine-grained expression cues. (c) Our CLIP-AUTT extends this for test-time personalization, combining entropy-guided temporal window selection with AU prompt tuning to adapt CLIP to each target subject. (d) Illustration of throughput (videos/sec), trainable parameters (M), and GFLOPs against WAR. Lower GFLOPs indicate better efficiency.
  • Figure 2: Overview of the CLIP-AU. The model aligns AU text embeddings with temporally encoded video representations to capture fine-grained facial dynamics for ER.
  • Figure 3: Overview of the CLIP-AUTT. Given a target video, CLIP-AUTT applies a sliding window with a temporal module to capture temporally coherent AU patterns and estimates window-level entropy from AU similarity to select the most expressive segment. The selected window is then used for AU prompt tuning, adapting AU embeddings to the target subject that results in a improve subject-specific personalized model.
  • Figure 4: Left: Temporal window selection analysis. Performance comparison across different window lengths ($L$). Middle: Comparison between AU prompts and generic class prompts under. Right: Qualitative comparison of predicted top-activated AUs and class predictions for CLIP-AU and CLIP-AUTT, alongside AU activations estimated by an external detector (OpenFace). CLIP-AUTT shows stronger agreement with the externally estimated AUs and better alignment with the underlying expression.
  • Figure 5: Representative samples from BioVid, StressID, and BAH highlight increased subject variability, facial movements (e.g., speaking or head motion), and environmental factors that can obscure subtle expression cues.
  • ...and 1 more figures