Table of Contents
Fetching ...

Spatio-temporal Transformers for Action Unit Classification with Event Cameras

Luca Cultrera, Federico Becattini, Lorenzo Berlincioni, Claudio Ferrari, Alberto Del Bimbo

TL;DR

A novel spatiotemporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams is proposed and outperforms baseline methods by effectively capturing spatial and temporal information.

Abstract

Face analysis has been studied from different angles to infer emotion, poses, shapes, and landmarks. Traditionally RGB cameras are used, yet for fine-grained tasks standard sensors might not be up to the task due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. We propose a novel spatiotemporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered one of the main causes of an existing gap between the maturity of RGB and neuromorphic vision models. Gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and contains streams collected with various possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. Our proposed model outperforms baseline methods by effectively capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.

Spatio-temporal Transformers for Action Unit Classification with Event Cameras

TL;DR

A novel spatiotemporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams is proposed and outperforms baseline methods by effectively capturing spatial and temporal information.

Abstract

Face analysis has been studied from different angles to infer emotion, poses, shapes, and landmarks. Traditionally RGB cameras are used, yet for fine-grained tasks standard sensors might not be up to the task due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. We propose a novel spatiotemporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered one of the main causes of an existing gap between the maturity of RGB and neuromorphic vision models. Gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and contains streams collected with various possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. Our proposed model outperforms baseline methods by effectively capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.

Paper Structure

This paper contains 22 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: We leverage cross-modal supervision obtainable from temporally synchronized RGB and Event streams to analyze faces using neuromorphic data. By extracting 3D face shape coefficients with standard RGB vision models, we can improve the training of event-based models without additional manual labeling.
  • Figure 2: Example of AU-3DMM components learned from the D3DFACS dataset. The heatmaps show the spatial extent of the deformation (red=high, blue=no deformation). The learned components capture AU specific facial movements.
  • Figure 3: Architecture overview. Each video frame is first augmented using Shifted Patch Tokenization and divided into tokens. Tokens are linearly projected and fed to a spatial transformer with Locality Self Attention, along with a spatial CLS token. This operation is performed in parallel for each frame. We retain only the CLS token output for each frame and feed them to the temporal transformer, which captures temporal patterns and generates the final classification using a feed-forward network applied to the output of the temporal CLS token.
  • Figure 4: Samples of Action Units being performed and estimated 3D face shape. Left: RGB frame; Center: corresponding event frame; Right: Reconstructed 3D mesh with the most active parts colored in red as distance from a neutral reference model.