Table of Contents
Fetching ...

Coupling deep and handcrafted features to assess smile genuineness

Benedykt Pawlus, Bogdan Smolka, Jolanta Kawulok, Michal Kawulok

TL;DR

The paper tackles smile genuineness recognition from video by fusing handcrafted AU dynamics (AUDA) with deep features from RealSmileNet in a late-fusion framework. It introduces frame-wise AUDA streams and phase-wise AU dynamics that are combined with deep CNN-LSTM representations, forming four parallel models whose outputs are concatenated for final classification. Empirical results on the UvA-NEMO dataset show that AUDA dynamics alone can outperform the deep features, and that their fusion yields the highest accuracy while remaining capable of real-time processing on standard GPUs. The work contributes interpretable AU-based cues, demonstrates the value of combining handcrafted dynamics with deep features for emotion-related tasks, and points to future directions in dynamic-focused network designs and super-resolution preprocessing.

Abstract

Assessing smile genuineness from video sequences is a vital topic concerned with recognizing facial expression and linking them with the underlying emotional states. There have been a number of techniques proposed underpinned with handcrafted features, as well as those that rely on deep learning to elaborate the useful features. As both of these approaches have certain benefits and limitations, in this work we propose to combine the features learned by a long short-term memory network with the features handcrafted to capture the dynamics of facial action units. The results of our experiments indicate that the proposed solution is more effective than the baseline techniques and it allows for assessing the smile genuineness from video sequences in real-time.

Coupling deep and handcrafted features to assess smile genuineness

TL;DR

The paper tackles smile genuineness recognition from video by fusing handcrafted AU dynamics (AUDA) with deep features from RealSmileNet in a late-fusion framework. It introduces frame-wise AUDA streams and phase-wise AU dynamics that are combined with deep CNN-LSTM representations, forming four parallel models whose outputs are concatenated for final classification. Empirical results on the UvA-NEMO dataset show that AUDA dynamics alone can outperform the deep features, and that their fusion yields the highest accuracy while remaining capable of real-time processing on standard GPUs. The work contributes interpretable AU-based cues, demonstrates the value of combining handcrafted dynamics with deep features for emotion-related tasks, and points to future directions in dynamic-focused network designs and super-resolution preprocessing.

Abstract

Assessing smile genuineness from video sequences is a vital topic concerned with recognizing facial expression and linking them with the underlying emotional states. There have been a number of techniques proposed underpinned with handcrafted features, as well as those that rely on deep learning to elaborate the useful features. As both of these approaches have certain benefits and limitations, in this work we propose to combine the features learned by a long short-term memory network with the features handcrafted to capture the dynamics of facial action units. The results of our experiments indicate that the proposed solution is more effective than the baseline techniques and it allows for assessing the smile genuineness from video sequences in real-time.

Paper Structure

This paper contains 6 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Outline of the RealSmileNet architecture. Each input image, being a difference between two consecutive frames, is processed with a frame-wise branch to extract deep features that are fed to a corresponding LSTM cell (red color). The output features of the last LSTM cell enter the classification block composed of the final branch and a dense sigmoid layer that retrieves the final decision on smile genuineness.
  • Figure 2: Outline of the frame-wise AUDA features classification scheme. The features extracted from each video frame are fed to a corresponding LSTM cell (red color). The output features of the last LSTM cell enter the classification block composed of the final branch and a dense sigmoid layer.
  • Figure 3: Outline of the architecture that classifies the phase-wise AUDA features (both AU-wise and cross-AU ones). The feature vector is fed to the classification block composed of the final branch and a dense sigmoid layer.
  • Figure 4: Examples of selected video frames from the UvA-NEMO dataset (for subjects no. 1 and no. 400) showing posed smiles (two upper rows) and spontaneous smiles (two lower rows).