Table of Contents
Fetching ...

Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling

Yicheng Deng, Hideaki Hayashi, Hajime Nagahara

TL;DR

This work tackles point-supervised facial expression spotting (P-FES), where training relies on a single timestamp per expression rather than full temporal boundaries. The authors propose a decoupled two-branch framework: a regression-based class-agnostic expression intensity branch empowered by Gaussian-based instance-adaptive intensity modeling (GIM) for soft pseudo-labeling, and a class-aware apex classification branch that distinguishes MaEs from MEs using pseudo-apex frames. An intensity-aware contrastive (IAC) loss further enhances discriminative learning by contrasting expression frames across intensities while suppressing neutral noise. Experiments on SAMM-LV, CAS(ME)$^2$, and CAS(ME)$^3$ demonstrate strong performance gains over state-of-the-art, validating the approach's effectiveness and potential to reduce annotation costs for practical FES deployment.

Abstract

Automatic facial expression spotting, which aims to identify facial expression instances in untrimmed videos, is crucial for facial expression analysis. Existing methods primarily focus on fully-supervised learning and rely on costly, time-consuming temporal boundary annotations. In this paper, we investigate point-supervised facial expression spotting (P-FES), where only a single timestamp annotation per instance is required for training. We propose a unique two-branch framework for P-FES. First, to mitigate the limitation of hard pseudo-labeling, which often confuses neutral and expression frames with various intensities, we propose a Gaussian-based instance-adaptive intensity modeling (GIM) module to model instance-level expression intensity distribution for soft pseudo-labeling. By detecting the pseudo-apex frame around each point label, estimating the duration, and constructing an instance-level Gaussian distribution, GIM assigns soft pseudo-labels to expression frames for more reliable intensity supervision. The GIM module is incorporated into our framework to optimize the class-agnostic expression intensity branch. Second, we design a class-aware apex classification branch that distinguishes macro- and micro-expressions solely based on their pseudo-apex frames. During inference, the two branches work independently: the class-agnostic expression intensity branch generates expression proposals, while the class-aware apex-classification branch is responsible for macro- and micro-expression classification. Furthermore, we introduce an intensity-aware contrastive loss to enhance discriminative feature learning and suppress neutral noise by contrasting neutral frames with expression frames with various intensities. Extensive experiments on the SAMM-LV, CAS(ME)$^2$, and CAS(ME)$^3$ datasets demonstrate the effectiveness of our proposed framework.

Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling

TL;DR

This work tackles point-supervised facial expression spotting (P-FES), where training relies on a single timestamp per expression rather than full temporal boundaries. The authors propose a decoupled two-branch framework: a regression-based class-agnostic expression intensity branch empowered by Gaussian-based instance-adaptive intensity modeling (GIM) for soft pseudo-labeling, and a class-aware apex classification branch that distinguishes MaEs from MEs using pseudo-apex frames. An intensity-aware contrastive (IAC) loss further enhances discriminative learning by contrasting expression frames across intensities while suppressing neutral noise. Experiments on SAMM-LV, CAS(ME), and CAS(ME) demonstrate strong performance gains over state-of-the-art, validating the approach's effectiveness and potential to reduce annotation costs for practical FES deployment.

Abstract

Automatic facial expression spotting, which aims to identify facial expression instances in untrimmed videos, is crucial for facial expression analysis. Existing methods primarily focus on fully-supervised learning and rely on costly, time-consuming temporal boundary annotations. In this paper, we investigate point-supervised facial expression spotting (P-FES), where only a single timestamp annotation per instance is required for training. We propose a unique two-branch framework for P-FES. First, to mitigate the limitation of hard pseudo-labeling, which often confuses neutral and expression frames with various intensities, we propose a Gaussian-based instance-adaptive intensity modeling (GIM) module to model instance-level expression intensity distribution for soft pseudo-labeling. By detecting the pseudo-apex frame around each point label, estimating the duration, and constructing an instance-level Gaussian distribution, GIM assigns soft pseudo-labels to expression frames for more reliable intensity supervision. The GIM module is incorporated into our framework to optimize the class-agnostic expression intensity branch. Second, we design a class-aware apex classification branch that distinguishes macro- and micro-expressions solely based on their pseudo-apex frames. During inference, the two branches work independently: the class-agnostic expression intensity branch generates expression proposals, while the class-aware apex-classification branch is responsible for macro- and micro-expression classification. Furthermore, we introduce an intensity-aware contrastive loss to enhance discriminative feature learning and suppress neutral noise by contrasting neutral frames with expression frames with various intensities. Extensive experiments on the SAMM-LV, CAS(ME), and CAS(ME) datasets demonstrate the effectiveness of our proposed framework.

Paper Structure

This paper contains 31 sections, 19 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Comparison of different forms of supervision. The fully-supervised method requires annotating the onset, apex, and offset frames for each instance, whereas the point-supervised method only requires annotating a single frame for each instance.
  • Figure 2: Motivation illustration of the re-designed overall framework. (a) General two-branch frameworks fuse the action scores with the expression intensity scores during both training and inference. However, since MEs typically exhibit lower intensity than MaEs, the fused ME scores are often suppressed, making MEs harder to spot. (b) In our re-designed framework, both branches are optimized independently, preventing MEs from being overshadowed during inference.
  • Figure 3: Motivation illustration of the soft pseudo-labeling. Due to the fact that expression frames have various intensities, it is difficult to describe this characteristic by hard pseudo-labeling. We use soft pseudo-labeling to learn the intensity distribution of each instance, reducing the ambiguity in distinguishing between neutral and expression frames with various intensities.
  • Figure 4: Overview of the proposed framework. The framework initially calculates the optical flow and extracts snippet features by SpotFormer. These features are fed into a two-branch framework to obtain class-agnostic expression intensity scores (right) and class-aware apex scores (left). A GIM module is employed to build the Gaussian distribution for each expression instance and assign soft pseudo-labels to model the intensity distribution. An IAC module is employed to build contrasts among pseudo-labeled frames with various intensities to enhance feature learning and suppress neutral noise.
  • Figure 5: Pseudo-label results of four expression instances. The line graph with blue dots represents the soft pseudo-labels assigned by our model; the leftmost and rightmost blue dots indicate the estimated expression duration for pseudo-labeling, while the peak dot indicates the pseudo-apex frame.
  • ...and 2 more figures