Table of Contents
Fetching ...

Weak Supervision with Arbitrary Single Frame for Micro- and Macro-expression Spotting

Wang-Wang Yu, Xian-Shi Zhang, Fu-Ya Luo, Yijun Cao, Kai-Fu Yang, Hong-Mei Yan, Yong-Jie Li

TL;DR

A point-level weakly-supervised expression spotting (PWES) framework, where each expression requires to be annotated with only one random frame (i.e., a point), and multi-refined pseudo label generation (MPLG) and distribution-guided feature contrastive learning (DFCL) to address these problems.

Abstract

Frame-level micro- and macro-expression spotting methods require time-consuming frame-by-frame observation during annotation. Meanwhile, video-level spotting lacks sufficient information about the location and number of expressions during training, resulting in significantly inferior performance compared with fully-supervised spotting. To bridge this gap, we propose a point-level weakly-supervised expression spotting (PWES) framework, where each expression requires to be annotated with only one random frame (i.e., a point). To mitigate the issue of sparse label distribution, the prevailing solution is pseudo-label mining, which, however, introduces new problems: localizing contextual background snippets results in inaccurate boundaries and discarding foreground snippets leads to fragmentary predictions. Therefore, we design the strategies of multi-refined pseudo label generation (MPLG) and distribution-guided feature contrastive learning (DFCL) to address these problems. Specifically, MPLG generates more reliable pseudo labels by merging class-specific probabilities, attention scores, fused features, and point-level labels. DFCL is utilized to enhance feature similarity for the same categories and feature variability for different categories while capturing global representations across the entire datasets. Extensive experiments on the CAS(ME)^2, CAS(ME)^3, and SAMM-LV datasets demonstrate PWES achieves promising performance comparable to that of recent fully-supervised methods.

Weak Supervision with Arbitrary Single Frame for Micro- and Macro-expression Spotting

TL;DR

A point-level weakly-supervised expression spotting (PWES) framework, where each expression requires to be annotated with only one random frame (i.e., a point), and multi-refined pseudo label generation (MPLG) and distribution-guided feature contrastive learning (DFCL) to address these problems.

Abstract

Frame-level micro- and macro-expression spotting methods require time-consuming frame-by-frame observation during annotation. Meanwhile, video-level spotting lacks sufficient information about the location and number of expressions during training, resulting in significantly inferior performance compared with fully-supervised spotting. To bridge this gap, we propose a point-level weakly-supervised expression spotting (PWES) framework, where each expression requires to be annotated with only one random frame (i.e., a point). To mitigate the issue of sparse label distribution, the prevailing solution is pseudo-label mining, which, however, introduces new problems: localizing contextual background snippets results in inaccurate boundaries and discarding foreground snippets leads to fragmentary predictions. Therefore, we design the strategies of multi-refined pseudo label generation (MPLG) and distribution-guided feature contrastive learning (DFCL) to address these problems. Specifically, MPLG generates more reliable pseudo labels by merging class-specific probabilities, attention scores, fused features, and point-level labels. DFCL is utilized to enhance feature similarity for the same categories and feature variability for different categories while capturing global representations across the entire datasets. Extensive experiments on the CAS(ME)^2, CAS(ME)^3, and SAMM-LV datasets demonstrate PWES achieves promising performance comparable to that of recent fully-supervised methods.
Paper Structure (25 sections, 19 equations, 2 figures, 11 tables)

This paper contains 25 sections, 19 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: A video containing frames #1 to #2273 from the CAS(ME)$^2$ dataset. The video contains three ground truth intervals, with the first interval containing a ME and the last two containing MaEs. We first preprocess the video by dividing it into uniform, non-overlapping snippets, each of which contains the same number of frames. A random frame is selected from each ground truth interval as one of the point-level labels to train our model. During training, we generate attention scores which signify the probabilities of foreground. During testing, we use these attention scores to generate proposals with different top-$k$ values yu2023weaklysupervised. Specifically, the green and orange blocks represent valid and invalid proposal intervals, respectively. Our objective is to identify consecutive video snippets that closely match the ground truth intervals.
  • Figure 2: The overall architecture of PWES, which consists of four parts: (a) Feature Extraction and Embedding to utilize a two-stream Inflated 3D ConvNets (I3D) model carreira2017quo to exact raw image features $\mathcal{X}_r$ and optical flow features $\mathcal{X}_f$ from uniform non-overlapping snippets. To further extract representational features, we utilize a core saliency compensation module from MC-WES yu2023weaklysupervised as our embedding module. Consequently, these extracted features are individually processed by the embedding module and then used to generate attention scores $\mathcal{A}_r$ for raw image modality and attention scores $\mathcal{A}_f$ for optical flow modality, respectively. The mean attention scores $\mathcal{A}$ indicate the probability that the snippet belongs to the foreground. In addition, the processed features are fused as $\mathcal{X}$; (b) Snippet Spotting to process temporal class activation maps (TCAMs) $\mathcal{S}$ and calculate mean class-specific probabilities based on the temporal top-$k$ pooling layer. These mean probabilities are used to compute two multiple instance learning (MIL) losses Dmaron1997framework, i.e., $\mathcal{L}_{mil}^1$ and $\mathcal{L}_{mil}^2$ with video-level labels $\mathcal{Y}^v$; (c) Snippet Mining to generate pseudo labels $\mathcal{\widehat{Y}}$ with a multi-refined pseudo label generation (MPLG) algorithm by merging fused video features $\mathcal{X}$, probabilities from TCAMs $\mathcal{S}$, attention scores $\mathcal{A}$, and point-level labels $\mathcal{Y}$. Generated pseudo labels $\mathcal{\widehat{Y}}$ are combined with point-level labels $\mathcal{Y}$ together to calculate the snippet-level classification loss $\mathcal{L}_{fl}$; (d) Representation Learning to implement the distribution-guided feature contrastive learning (DFCL) algorithm with a memory bank. We use a distribution-guided feature sampling (DFS) module to calculate region-level vectors with fused video features $\mathcal{X}$ and pseudo labels $\mathcal{\widehat{Y}}$. Those region-level vectors are used to update the memory bank and calculate the contrastive learning loss $\mathcal{L}_{cl}$.