Table of Contents
Fetching ...

A$^{3}$lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment for Dynamic Facial Expression Recognition with CLIP

Zeng Tao, Yan Wang, Junxiong Lin, Haoran Wang, Xinji Mai, Jiawen Yu, Xuan Tong, Ziheng Zhou, Shaoqi Yan, Qing Zhao, Liyuan Han, Wenqiang Zhang

TL;DR

The paper targets the gap in CLIP-based dynamic facial expression recognition (DFER) arising from abstract labels and temporal video dynamics. It introduces A$^{3}$lign-DFER, which combines Multi-Dimensional Alignment Tokens (MAT) of shape $Cls\times Snt\times Tkn\times Embd$, a Joint Dynamic Alignment Synchronizer (JAS), and a Bidirectional Alignment Paradigm (BAP) to achieve affective, dynamic, and bidirectional alignment while keeping CLIP encoders frozen. Empirical results on DFEW, FERV39k, and MAFW show state-of-the-art WAR and UAR, with ablations confirming the significant contributions of MAT and JAS; the approach leverages CLIP priors to enhance dynamic affective alignment in DFER. The work points toward future zero-shot DFER by expanding affective labels and pursuing full alignment, strengthening CLIP's applicability to dynamic human-centric tasks.

Abstract

The performance of CLIP in dynamic facial expression recognition (DFER) task doesn't yield exceptional results as observed in other CLIP-based classification tasks. While CLIP's primary objective is to achieve alignment between images and text in the feature space, DFER poses challenges due to the abstract nature of text and the dynamic nature of video, making label representation limited and perfect alignment difficult. To address this issue, we have designed A$^{3}$lign-DFER, which introduces a new DFER labeling paradigm to comprehensively achieve alignment, thus enhancing CLIP's suitability for the DFER task. Specifically, our A$^{3}$lign-DFER method is designed with multiple modules that work together to obtain the most suitable expanded-dimensional embeddings for classification and to achieve alignment in three key aspects: affective, dynamic, and bidirectional. We replace the input label text with a learnable Multi-Dimensional Alignment Token (MAT), enabling alignment of text to facial expression video samples in both affective and dynamic dimensions. After CLIP feature extraction, we introduce the Joint Dynamic Alignment Synchronizer (JAS), further facilitating synchronization and alignment in the temporal dimension. Additionally, we implement a Bidirectional Alignment Training Paradigm (BAP) to ensure gradual and steady training of parameters for both modalities. Our insightful and concise A$^{3}$lign-DFER method achieves state-of-the-art results on multiple DFER datasets, including DFEW, FERV39k, and MAFW. Extensive ablation experiments and visualization studies demonstrate the effectiveness of A$^{3}$lign-DFER. The code will be available in the future.

A$^{3}$lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment for Dynamic Facial Expression Recognition with CLIP

TL;DR

The paper targets the gap in CLIP-based dynamic facial expression recognition (DFER) arising from abstract labels and temporal video dynamics. It introduces Align-DFER, which combines Multi-Dimensional Alignment Tokens (MAT) of shape , a Joint Dynamic Alignment Synchronizer (JAS), and a Bidirectional Alignment Paradigm (BAP) to achieve affective, dynamic, and bidirectional alignment while keeping CLIP encoders frozen. Empirical results on DFEW, FERV39k, and MAFW show state-of-the-art WAR and UAR, with ablations confirming the significant contributions of MAT and JAS; the approach leverages CLIP priors to enhance dynamic affective alignment in DFER. The work points toward future zero-shot DFER by expanding affective labels and pursuing full alignment, strengthening CLIP's applicability to dynamic human-centric tasks.

Abstract

The performance of CLIP in dynamic facial expression recognition (DFER) task doesn't yield exceptional results as observed in other CLIP-based classification tasks. While CLIP's primary objective is to achieve alignment between images and text in the feature space, DFER poses challenges due to the abstract nature of text and the dynamic nature of video, making label representation limited and perfect alignment difficult. To address this issue, we have designed Align-DFER, which introduces a new DFER labeling paradigm to comprehensively achieve alignment, thus enhancing CLIP's suitability for the DFER task. Specifically, our Align-DFER method is designed with multiple modules that work together to obtain the most suitable expanded-dimensional embeddings for classification and to achieve alignment in three key aspects: affective, dynamic, and bidirectional. We replace the input label text with a learnable Multi-Dimensional Alignment Token (MAT), enabling alignment of text to facial expression video samples in both affective and dynamic dimensions. After CLIP feature extraction, we introduce the Joint Dynamic Alignment Synchronizer (JAS), further facilitating synchronization and alignment in the temporal dimension. Additionally, we implement a Bidirectional Alignment Training Paradigm (BAP) to ensure gradual and steady training of parameters for both modalities. Our insightful and concise Align-DFER method achieves state-of-the-art results on multiple DFER datasets, including DFEW, FERV39k, and MAFW. Extensive ablation experiments and visualization studies demonstrate the effectiveness of Align-DFER. The code will be available in the future.
Paper Structure (17 sections, 6 equations, 4 figures, 6 tables)

This paper contains 17 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The comparison between the basic paradigm of dynamic facial expression recognition and our A$^{3}$lign-DFER. (a) represents traditional DFER methods wang2023rethinkingwang2022dpcnetli2023intensity, training discriminative models to obtain classification results. (b) represents the CLIP method radford2021learning, classifying by comparing the similarity in the feature space between videos and label text encoding. (c) represents the CLIP-based prompt learning method zhou2022learningzhou2022conditionalli2023cliperzhao2023prompting, using learnable embeddings instead of fixed label embeddings. (d) represents our proposed A$^{3}$lign-DFER, achieving the best match between expression videos and classes through expanded-dimensional learning and comprehensive dynamic affective alignment.
  • Figure 2: A$^{3}$lign-DFER Method Workflow Overview. (a) Illustrates the A$^{3}$lign-DFER process, processing facial expressions and MAT through CLIP encoders, then into JAS for class-specific video features and classification. (b) Details MAT's structure and dimension meanings. (c) Shows JAS module construction. (d) Highlights the Bidirectional Alignment Training Paradigm (BAP) in our method.
  • Figure 3: Ablation study of the hyper-parameters of MAT.
  • Figure 4: The t-SNE visualizations on the Test Data of the DFEW, FERV39k and MAFW datasets. In each visualization figure, the left image displays the t-SNE results of the A$^{3}$lign-DFER process, the upper right image details the t-SNE results for video sample classes, and the lower right image details the t-SNE results of the text flow.