Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

Mengzhu Li; Quanxing Zha; Hongjun Wu

Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

Mengzhu Li, Quanxing Zha, Hongjun Wu

TL;DR

This work tackles inefficiency in dynamic facial expression recognition by introducing AdaTosk, a two-branch framework that fuses self-supervised reconstruction with supervised classification and employs an adaptive temporal soft mask to suppress redundant temporal information. The mask has two components: a class-agnostic dynamic mask that highlights dynamic moments and a class-semantic similar mask that preserves semantically important tokens across time, jointly yielding a final mask for efficient token selection. Empirically, AdaTosk achieves competitive or better performance than state-of-the-art methods on DFEW, FERV39K, and MAFW, while reducing parameters and FLOPs by notable margins, demonstrating significant computational savings without sacrificing accuracy. This approach advances practical DFER by enabling robust recognition in real-world, resource-constrained settings through targeted temporal focus and redundancy reduction.

Abstract

Dynamic Facial Expression Recognition (DFER) facilitates the understanding of psychological intentions through non-verbal communication. Existing methods struggle to manage irrelevant information, such as background noise and redundant semantics, which impacts both efficiency and effectiveness. In this work, we propose a novel supervised temporal soft masked autoencoder network for DFER, namely AdaTosk, which integrates a parallel supervised classification branch with the self-supervised reconstruction branch. The self-supervised reconstruction branch applies random binary hard mask to generate diverse training samples, encouraging meaningful feature representations in visible tokens. Meanwhile the classification branch employs an adaptive temporal soft mask to flexibly mask visible tokens based on their temporal significance. Its two key components, respectively of, class-agnostic and class-semantic soft masks, serve to enhance critical expression moments and reduce semantic redundancy over time. Extensive experiments conducted on widely-used benchmarks demonstrate that our AdaTosk remarkably reduces computational costs compared with current state-of-the-art methods while still maintaining competitive performance.

Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

TL;DR

Abstract

Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)