Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition
Gunho Jung, Heejo Kong, Seong-Whan Lee
TL;DR
This work tackles dynamic facial expression recognition under weak supervision, where only clip-level labels are available. It proposes TG-DFER, a text-guided MIL framework that fuses a vision-language pre-trained model with learnable text prompts and a novel visual prompt to refine frame-level emotion cues, coupled with a multi-grained temporal network that captures both short-term and long-range dynamics. The approach yields state-of-the-art results on DFEW and FERV39k, with clear ablations demonstrating the benefits of semantic guidance, cross-modal alignment, and multi-scale temporal reasoning. These contributions enhance generalization and interpretability in in-the-wild DFER, suggesting a robust direction for future cross-modal, weakly supervised affective recognition systems.
Abstract
Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.
