Table of Contents
Fetching ...

Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

Mengzhu Li, Quanxing Zha, Hongjun Wu

TL;DR

This work tackles inefficiency in dynamic facial expression recognition by introducing AdaTosk, a two-branch framework that fuses self-supervised reconstruction with supervised classification and employs an adaptive temporal soft mask to suppress redundant temporal information. The mask has two components: a class-agnostic dynamic mask that highlights dynamic moments and a class-semantic similar mask that preserves semantically important tokens across time, jointly yielding a final mask for efficient token selection. Empirically, AdaTosk achieves competitive or better performance than state-of-the-art methods on DFEW, FERV39K, and MAFW, while reducing parameters and FLOPs by notable margins, demonstrating significant computational savings without sacrificing accuracy. This approach advances practical DFER by enabling robust recognition in real-world, resource-constrained settings through targeted temporal focus and redundancy reduction.

Abstract

Dynamic Facial Expression Recognition (DFER) facilitates the understanding of psychological intentions through non-verbal communication. Existing methods struggle to manage irrelevant information, such as background noise and redundant semantics, which impacts both efficiency and effectiveness. In this work, we propose a novel supervised temporal soft masked autoencoder network for DFER, namely AdaTosk, which integrates a parallel supervised classification branch with the self-supervised reconstruction branch. The self-supervised reconstruction branch applies random binary hard mask to generate diverse training samples, encouraging meaningful feature representations in visible tokens. Meanwhile the classification branch employs an adaptive temporal soft mask to flexibly mask visible tokens based on their temporal significance. Its two key components, respectively of, class-agnostic and class-semantic soft masks, serve to enhance critical expression moments and reduce semantic redundancy over time. Extensive experiments conducted on widely-used benchmarks demonstrate that our AdaTosk remarkably reduces computational costs compared with current state-of-the-art methods while still maintaining competitive performance.

Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

TL;DR

This work tackles inefficiency in dynamic facial expression recognition by introducing AdaTosk, a two-branch framework that fuses self-supervised reconstruction with supervised classification and employs an adaptive temporal soft mask to suppress redundant temporal information. The mask has two components: a class-agnostic dynamic mask that highlights dynamic moments and a class-semantic similar mask that preserves semantically important tokens across time, jointly yielding a final mask for efficient token selection. Empirically, AdaTosk achieves competitive or better performance than state-of-the-art methods on DFEW, FERV39K, and MAFW, while reducing parameters and FLOPs by notable margins, demonstrating significant computational savings without sacrificing accuracy. This approach advances practical DFER by enabling robust recognition in real-world, resource-constrained settings through targeted temporal focus and redundancy reduction.

Abstract

Dynamic Facial Expression Recognition (DFER) facilitates the understanding of psychological intentions through non-verbal communication. Existing methods struggle to manage irrelevant information, such as background noise and redundant semantics, which impacts both efficiency and effectiveness. In this work, we propose a novel supervised temporal soft masked autoencoder network for DFER, namely AdaTosk, which integrates a parallel supervised classification branch with the self-supervised reconstruction branch. The self-supervised reconstruction branch applies random binary hard mask to generate diverse training samples, encouraging meaningful feature representations in visible tokens. Meanwhile the classification branch employs an adaptive temporal soft mask to flexibly mask visible tokens based on their temporal significance. Its two key components, respectively of, class-agnostic and class-semantic soft masks, serve to enhance critical expression moments and reduce semantic redundancy over time. Extensive experiments conducted on widely-used benchmarks demonstrate that our AdaTosk remarkably reduces computational costs compared with current state-of-the-art methods while still maintaining competitive performance.

Paper Structure

This paper contains 11 sections, 13 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Redundancy in visible tokens. Row 1: All original frames. Row 2: Frames after random masked. Row 3: Gray-masked regions indicate redundant temporal information in visible tokens. Yellow lines connect all distinct frames/tokens. (i) Key moments denote the important frames; (ii) Similarity accumulation indicates the redundancy of resemble tokens.
  • Figure 2: Overall framework of AdaTosk. The left part illustrates the overall model architecture with two parallel branches. After the hard mask, the visible tokens are processed by the encoder. Then the decoder reconstructs the masked tokens using the reconstruction loss $\mathcal{L}_{rec}$, meanwhile a temporal soft mask is applied for expression recognition with the classification loss $\mathcal{L}_{cls}$. The right part represents the adaptive temporal soft mask mechanism: (i) Class-agnostic dynamic soft mask identifies key frames; (ii) Class-semantic similar soft mask transfers the accumulative temporal score between adjacent frames; (iii) The final temporal soft mask combines results from (i) and (ii).
  • Figure 3: Visualizations of class activated maps and two temporal soft masks. Row 1: Original frames. Row 2: Class-semantic activated maps. Row 3: Visible patches after hard masking. Row 4: Class-semantic (CS) similar soft mask on visible patches. Row 5: Class-agnostic (CA) dynamic soft mask on visible patches. Essential dynamic regions are highlighted in red and orange. Colors from blue to purple to gray/black represent the soft mask degree, from light to heavy.