Table of Contents
Fetching ...

Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

Gunho Jung, Heejo Kong, Seong-Whan Lee

TL;DR

This work tackles dynamic facial expression recognition under weak supervision, where only clip-level labels are available. It proposes TG-DFER, a text-guided MIL framework that fuses a vision-language pre-trained model with learnable text prompts and a novel visual prompt to refine frame-level emotion cues, coupled with a multi-grained temporal network that captures both short-term and long-range dynamics. The approach yields state-of-the-art results on DFEW and FERV39k, with clear ablations demonstrating the benefits of semantic guidance, cross-modal alignment, and multi-scale temporal reasoning. These contributions enhance generalization and interpretability in in-the-wild DFER, suggesting a robust direction for future cross-modal, weakly supervised affective recognition systems.

Abstract

Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.

Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

TL;DR

This work tackles dynamic facial expression recognition under weak supervision, where only clip-level labels are available. It proposes TG-DFER, a text-guided MIL framework that fuses a vision-language pre-trained model with learnable text prompts and a novel visual prompt to refine frame-level emotion cues, coupled with a multi-grained temporal network that captures both short-term and long-range dynamics. The approach yields state-of-the-art results on DFEW and FERV39k, with clear ablations demonstrating the benefits of semantic guidance, cross-modal alignment, and multi-scale temporal reasoning. These contributions enhance generalization and interpretability in in-the-wild DFER, suggesting a robust direction for future cross-modal, weakly supervised affective recognition systems.

Abstract

Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.

Paper Structure

This paper contains 28 sections, 12 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the proposed text-guided weakly supervised framework for DFER. The weakly supervised paradigm operates under coarse clip-level supervision. In contrast, our method incorporates text-guided alignment by injecting semantic priors via pre-trained visual and text encoders.
  • Figure 2: Illustration of the proposed TG-DFER framework. (a) The overall architecture integrates visual and textual modalities within a weakly supervised learning paradigm. (b) The multi-grained temporal network captures detailed frame-wise information and long-term dynamics, ensuring temporal coherence and spatial discrimination. (c) The visual prompt module constructs enhanced fine-grained label features by enriching text labels with visual context and performing instance-level alignment.
  • Figure 3: Confusion matrix of our proposed TG-DFER evaluated on 5-fold DFEW (a)-(e) and FERV39k (f).
  • Figure 4: Visualization results of enhanced fine-grained label feature influence in the DFEW dataset.