Table of Contents
Fetching ...

evTransFER: A Transfer Learning Framework for Event-based Facial Expression Recognition

Rodrigo Verschae, Ignacio Bugueno-Cordova

TL;DR

This work tackles event-based facial expression recognition by addressing data sparsity and temporal dynamics with a transfer-learning approach. It introduces Temporal Information of Events (TIE), a reconstruction-driven encoder transfer from facial frame reconstruction, and an LSTM-based temporal module to capture long-term expressions. On synthetic e-CK+ and real NEFER data, evTransFER significantly outperforms state-of-the-art event-based FER methods, achieving up to 93.6% top-1 accuracy on CK+ and ~76% on NEFER, with notable gains from reconstruction-based pretraining and fine-tuning. The approach enables near real-time inference and suggests broader applicability of reconstruction-informed encoders to other datasets and object recognition tasks in neuromorphic vision.

Abstract

Event-based cameras are bio-inspired sensors that asynchronously capture pixel intensity changes with microsecond latency, high temporal resolution, and high dynamic range, providing information on the spatiotemporal dynamics of a scene. We propose evTransFER, a transfer learning-based framework for facial expression recognition using event-based cameras. The main contribution is a feature extractor designed to encode facial spatiotemporal dynamics, built by training an adversarial generative method on facial reconstruction and transferring the encoder weights to the facial expression recognition system. We demonstrate that the proposed transfer learning method improves facial expression recognition compared to training a network from scratch. We propose an architecture that incorporates an LSTM to capture longer-term facial expression dynamics and introduces a new event-based representation called TIE. We evaluated the framework using both the synthetic event-based facial expression database e-CK+ and the real neuromorphic dataset NEFER. On e-CK+, evTransFER achieved a recognition rate of 93.6\%, surpassing state-of-the-art methods. For NEFER, which comprises event sequence with real sensor noise and sparse activity, the proposed transfer learning strategy achieved an accuracy of up to 76.7\%. In both datasets, the outcomes surpassed current methodologies and exceeded results when compared with models trained from scratch.

evTransFER: A Transfer Learning Framework for Event-based Facial Expression Recognition

TL;DR

This work tackles event-based facial expression recognition by addressing data sparsity and temporal dynamics with a transfer-learning approach. It introduces Temporal Information of Events (TIE), a reconstruction-driven encoder transfer from facial frame reconstruction, and an LSTM-based temporal module to capture long-term expressions. On synthetic e-CK+ and real NEFER data, evTransFER significantly outperforms state-of-the-art event-based FER methods, achieving up to 93.6% top-1 accuracy on CK+ and ~76% on NEFER, with notable gains from reconstruction-based pretraining and fine-tuning. The approach enables near real-time inference and suggests broader applicability of reconstruction-informed encoders to other datasets and object recognition tasks in neuromorphic vision.

Abstract

Event-based cameras are bio-inspired sensors that asynchronously capture pixel intensity changes with microsecond latency, high temporal resolution, and high dynamic range, providing information on the spatiotemporal dynamics of a scene. We propose evTransFER, a transfer learning-based framework for facial expression recognition using event-based cameras. The main contribution is a feature extractor designed to encode facial spatiotemporal dynamics, built by training an adversarial generative method on facial reconstruction and transferring the encoder weights to the facial expression recognition system. We demonstrate that the proposed transfer learning method improves facial expression recognition compared to training a network from scratch. We propose an architecture that incorporates an LSTM to capture longer-term facial expression dynamics and introduces a new event-based representation called TIE. We evaluated the framework using both the synthetic event-based facial expression database e-CK+ and the real neuromorphic dataset NEFER. On e-CK+, evTransFER achieved a recognition rate of 93.6\%, surpassing state-of-the-art methods. For NEFER, which comprises event sequence with real sensor noise and sparse activity, the proposed transfer learning strategy achieved an accuracy of up to 76.7\%. In both datasets, the outcomes surpassed current methodologies and exceeded results when compared with models trained from scratch.

Paper Structure

This paper contains 36 sections, 8 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Pipeline for event-based classification that addresses the need of (i) a representation that captures the event's spatial distribution and asynchronous temporal evolution, (ii) a feature extractor that properly encodes the input information, (iii) an analysis over a long sequence of events, (iv) a classifier module.
  • Figure 2: Proposed Architecture for event-based facial expression recognition, comprising (i) a new event representation called TIE (Temporal Information of Events), (ii) a feature extractor module for identifying and encoding events' key patterns (encoder trained for face reconstruction, see Section \ref{['sec:methodology-encoded']}), (iii) temporal memory networks enabling temporal learning of event sequences, and (iv) a classification module that categorizes events into facial expressions.
  • Figure 3: Pipeline of the proposed representation: Temporal Information of Events (TIE). TIE is based on the EST representation gehrig2019est.
  • Figure 4: Variants of TIE event representation generated from the e-CK+ dataset by changing the temporal variable $\tau \in \{ \tau_k, \hat{\tau_k}\}$, parameter of measurement function $f$ and the kernel function $h$.
  • Figure 5: Proposed architecture for event-based facial frame reconstruction. The architecture involves a conditional Generative Adversarial Network that reconstructs frames from a latent variable. This reconstruction system comprises three components: (i) the TIE representation, which serves as a latent variable; (ii) the generator network, which is responsible for reconstructing facial frames from events; and (iii) the discriminator network, which assesses the likelihood of the generated frame against the original frame, providing feedback to the generator based on the loss incurred. The system accepts two inputs: an event sequence and the corresponding frames.
  • ...and 4 more figures