Table of Contents
Fetching ...

PESFormer: Boosting Macro- and Micro-expression Spotting with Direct Timestamp Encoding

Wang-Wang Yu, Kai-Fu Yang, Xiangrui Hu, Jingwen Jiang, Hong-Mei Yan, Yong-Jie Li

TL;DR

PESFormer is introduced, a simple yet effective model based on the vision transformer architecture to achieve point-to-interval expression spotting that employs a direct timestamp encoding (DTE) approach to replace anchors, enabling binary classification of each timestamp instead of optimizing entire ground truths.

Abstract

The task of macro- and micro-expression spotting aims to precisely localize and categorize temporal expression instances within untrimmed videos. Given the sparse distribution and varying durations of expressions, existing anchor-based methods often represent instances by encoding their deviations from predefined anchors. Additionally, these methods typically slice the untrimmed videos into fixed-length sliding windows. However, anchor-based encoding often fails to capture all training intervals, and slicing the original video as sliding windows can result in valuable training intervals being discarded. To overcome these limitations, we introduce PESFormer, a simple yet effective model based on the vision transformer architecture to achieve point-to-interval expression spotting. PESFormer employs a direct timestamp encoding (DTE) approach to replace anchors, enabling binary classification of each timestamp instead of optimizing entire ground truths. Thus, all training intervals are retained in the form of discrete timestamps. To maximize the utilization of training intervals, we enhance the preprocessing process by replacing the short videos produced through the sliding window method.Instead, we implement a strategy that involves zero-padding the untrimmed training videos to create uniform, longer videos of a predetermined duration. This operation efficiently preserves the original training intervals and eliminates video slice enhancement.Extensive qualitative and quantitative evaluations on three datasets -- CAS(ME)^2, CAS(ME)^3 and SAMM-LV -- demonstrate that our PESFormer outperforms existing techniques, achieving the best performance.

PESFormer: Boosting Macro- and Micro-expression Spotting with Direct Timestamp Encoding

TL;DR

PESFormer is introduced, a simple yet effective model based on the vision transformer architecture to achieve point-to-interval expression spotting that employs a direct timestamp encoding (DTE) approach to replace anchors, enabling binary classification of each timestamp instead of optimizing entire ground truths.

Abstract

The task of macro- and micro-expression spotting aims to precisely localize and categorize temporal expression instances within untrimmed videos. Given the sparse distribution and varying durations of expressions, existing anchor-based methods often represent instances by encoding their deviations from predefined anchors. Additionally, these methods typically slice the untrimmed videos into fixed-length sliding windows. However, anchor-based encoding often fails to capture all training intervals, and slicing the original video as sliding windows can result in valuable training intervals being discarded. To overcome these limitations, we introduce PESFormer, a simple yet effective model based on the vision transformer architecture to achieve point-to-interval expression spotting. PESFormer employs a direct timestamp encoding (DTE) approach to replace anchors, enabling binary classification of each timestamp instead of optimizing entire ground truths. Thus, all training intervals are retained in the form of discrete timestamps. To maximize the utilization of training intervals, we enhance the preprocessing process by replacing the short videos produced through the sliding window method.Instead, we implement a strategy that involves zero-padding the untrimmed training videos to create uniform, longer videos of a predetermined duration. This operation efficiently preserves the original training intervals and eliminates video slice enhancement.Extensive qualitative and quantitative evaluations on three datasets -- CAS(ME)^2, CAS(ME)^3 and SAMM-LV -- demonstrate that our PESFormer outperforms existing techniques, achieving the best performance.

Paper Structure

This paper contains 30 sections, 11 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: A long video from the CAS(ME)$^2$ dataset, spanning from frame #1 to frame #2273, undergoes a preprocessing stage where it is divided into a sequence of uniform timestamp snippets. During training, each timestamp snippet contributes to the generation of a foreground probability, which is subsequently utilized in computing the loss. During testing, these foreground probabilities serve as indicators to identify valid timestamp snippets. These valid timestamp snippets are then leveraged to generate proposals. Our objective is to spot a set of consecutive video intervals that closely align with the ground truths.
  • Figure 2: Overall schematic of PESFormer. Given a video, we extract video features $\mathcal{X}_{r}$ and optical flow features $\mathcal{X}_{f}$ using a two-stream Inflated 3D ConvNets (I3D) model carreira2017quo. During training, these features are derived from a set of uniformly sampled, overlapping snippets from the video and its corresponding optical flow. The input features $\mathcal{X}$ are formed by concatenating $\mathcal{X}_{r}$ and $\mathcal{X}_{f}$. Next, the input features $\mathcal{X}$ undergo $p_1$ convolution layers in the embedding component, resulting in embedded features $\mathcal{X}_e$. These embedded features $\mathcal{X}_e$ are then processed by $p_2$ transformer networks in the temporal encoding component, generating fine-grained features $\mathcal{X}_l$. To capture multiscale temporal information, we deploy $p_3$ downsampling transformer (DTransformer) networks, creating a feature pyramid network. The outputs of this pyramid are $[\widetilde{\mathcal{X}}_l^1, ..., \widetilde{\mathcal{X}}_l^{p_3}]$, which are subsequently used to produce snippet-level probabilities $\mathcal{O}$ to indicate the likelihood of each snippet belonging to the foreground. During testing, snippet-level probabilities $\mathcal{O}$ are employed to identify valid snippets, which are subsequently combined to form expression proposals.
  • Figure 3: Snippet extraction process. Given a video that comprises a total of $L$ frames, we split it into timestamp snippet, each consisting of consecutive $s$ frames. The overlap between neighboring timestamp snippets are $\delta$ frames.
  • Figure 4: Illustration of the two encoding methods of the proposed DTE and the anchor-based encoding. The DTE method, meticulously assesses whether each individual timestamp snippet suffices to qualify as part of the foreground, embodying a localized approach. Conversely, the anchor-based encoding identifies valid anchors by computing the intersection over union (IoU) between pre-defined anchors and training intervals, subsequently selecting those anchors whose IoU exceeds a specified threshold. These valid anchors are utilized to encode training intervals in terms of center deviation $\Delta_c$ and duration deviations $\Delta_l$. Each anchor's corresponding set of deviations from the training intervals is then assigned a class $y$ label derived from the training intervals themselves.
  • Figure 5: Three examples illustrating two different preprocessing methods for dividing long videos into multiple snippets. The first method is the sliding window approach, which involves dividing untrimmed videos of varying durations into a substantial collection of shorter videos, with uniform length, specifically tailored for training purposes. Alternatively, our large fixed duration method ensures duration consistency by appending zeros (zero-padding) to untrimmed videos of different durations, thereby unifying their durations prior to utilization in the training process.