Table of Contents
Fetching ...

Deep Learning-Based Real-Time Sequential Facial Expression Analysis Using Geometric Features

Talha Enes Koksal, Abdurrahman Gumus

TL;DR

The paper tackles real-time sequential macro-expression recognition by leveraging geometric features derived from MediaPipe FaceMesh landmarks and a ConvLSTM1D–MLP classifier.By computing frame-to-frame Euclidean distances and angles between landmark pairs and processing 5-frame sequences, the method captures onset, apex, and offset phases of expressions.Experiments on CK+, Oulu-CASIA (VIS/NIR), and MMI show competitive accuracies, with strong real-time performance (~165 fps on a RTX 3060) and good generalization in composite dataset tests.The work provides an open-source, fast, and adaptable framework for real-time emotion-aware applications and sets the stage for further improvements in robustness across varying conditions.

Abstract

Facial expression recognition is a crucial component in enhancing human-computer interaction and developing emotion-aware systems. Real-time detection and interpretation of facial expressions have become increasingly important for various applications, from user experience personalization to intelligent surveillance systems. This study presents a novel approach to real-time sequential facial expression recognition using deep learning and geometric features. The proposed method utilizes MediaPipe FaceMesh for rapid and accurate facial landmark detection. Geometric features, including Euclidean distances and angles, are extracted from these landmarks. Temporal dynamics are incorporated by analyzing feature differences between consecutive frames, enabling the detection of onset, apex, and offset phases of expressions. For classification, a ConvLSTM1D network followed by multilayer perceptron blocks is employed. The method's performance was evaluated on multiple publicly available datasets, including CK+, Oulu-CASIA (VIS and NIR), and MMI. Accuracies of 93%, 79%, 77%, and 68% were achieved respectively. Experiments with composite datasets were also conducted to assess the model's generalization capabilities. The approach demonstrated real-time applicability, processing approximately 165 frames per second on consumer-grade hardware. This research contributes to the field of facial expression analysis by providing a fast, accurate, and adaptable solution. The findings highlight the potential for further advancements in emotion-aware technologies and personalized user experiences, paving the way for more sophisticated human-computer interaction systems. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: https://github.com/miralab-ai/facial-expression-analysis.

Deep Learning-Based Real-Time Sequential Facial Expression Analysis Using Geometric Features

TL;DR

The paper tackles real-time sequential macro-expression recognition by leveraging geometric features derived from MediaPipe FaceMesh landmarks and a ConvLSTM1D–MLP classifier.By computing frame-to-frame Euclidean distances and angles between landmark pairs and processing 5-frame sequences, the method captures onset, apex, and offset phases of expressions.Experiments on CK+, Oulu-CASIA (VIS/NIR), and MMI show competitive accuracies, with strong real-time performance (~165 fps on a RTX 3060) and good generalization in composite dataset tests.The work provides an open-source, fast, and adaptable framework for real-time emotion-aware applications and sets the stage for further improvements in robustness across varying conditions.

Abstract

Facial expression recognition is a crucial component in enhancing human-computer interaction and developing emotion-aware systems. Real-time detection and interpretation of facial expressions have become increasingly important for various applications, from user experience personalization to intelligent surveillance systems. This study presents a novel approach to real-time sequential facial expression recognition using deep learning and geometric features. The proposed method utilizes MediaPipe FaceMesh for rapid and accurate facial landmark detection. Geometric features, including Euclidean distances and angles, are extracted from these landmarks. Temporal dynamics are incorporated by analyzing feature differences between consecutive frames, enabling the detection of onset, apex, and offset phases of expressions. For classification, a ConvLSTM1D network followed by multilayer perceptron blocks is employed. The method's performance was evaluated on multiple publicly available datasets, including CK+, Oulu-CASIA (VIS and NIR), and MMI. Accuracies of 93%, 79%, 77%, and 68% were achieved respectively. Experiments with composite datasets were also conducted to assess the model's generalization capabilities. The approach demonstrated real-time applicability, processing approximately 165 frames per second on consumer-grade hardware. This research contributes to the field of facial expression analysis by providing a fast, accurate, and adaptable solution. The findings highlight the potential for further advancements in emotion-aware technologies and personalized user experiences, paving the way for more sophisticated human-computer interaction systems. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: https://github.com/miralab-ai/facial-expression-analysis.

Paper Structure

This paper contains 12 sections, 11 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Collage of images in datasets which are used in macro-expression experiments. (a) Cohn-Kanade (CK+) dataset. (b) Oulu-CASIA dataset Near Infrared (NI) variation. (c) Oulu-CASIA dataset Visible Light (VIS) variation. (d) MMI dataset, only subjects that have sequential data is used.
  • Figure 2: Feature creation algorithm flow. Sequential camera frames are processed by the MediaPipe FaceMesh algorithm to extract landmark coordinates. Selected frames' facial landmarks are used to create features by calculating Euclidean distance and angle between current and previous frames' landmarks. These two features are concatenated to create feature vector. Images of the subject are taken from CK+ dataset.
  • Figure 3: Classification algorithm for macro-expression experiments. First $N-1, A$ shaped array that holds feature vectors for whole sequence is scaled using standard scaler. Scaled 1D data is converted to image format that will be feed into ConvLSTM1D block. Output of ConvLSTM1D block is flattened and data is classified using multi-layer perceptron layers. Where, $e$: emotion count, $a$: feature count.
  • Figure 4: Diagram illustrating the internal structure of a ConvLSTM cell. The figure depicts the convolutional operations applied to the input and hidden states, showcasing the processes of input modulation, forget gate, input gate, and output gate. These components work together to capture spatiotemporal features in sequential data.
  • Figure 5: Accuracy vs epoch graphs for (a) CK+ and (b) Oulu-CASIA datasets trained using 61 landmarks with AU grouping. Shaded area shows min-max regions of each fold of 5-fold cross validation while bold lines shows average of them.
  • ...and 5 more figures