ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos

Maria Luísa Lima; Willams de Lima Costa; Estefania Talavera Martinez; Veronica Teichrieb

ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos

Maria Luísa Lima, Willams de Lima Costa, Estefania Talavera Martinez, Veronica Teichrieb

TL;DR

The paper addresses emotion recognition from gait, arguing that gait provides informative nonverbal cues beyond facial expressions. It proposes ST-Gait++, a skeleton-based spatio-temporal graph convolutional network operating on a 16-joint 3D skeleton with three ST-GCN++ blocks to classify four emotions. On the E-Gait dataset, it achieves approximately $87.5\%$ accuracy, about $5.4$ percentage points above the STEP baseline, and converges roughly $3.63\times$ faster in training. The work also discusses limitations of the dataset and bias/diversity considerations, highlighting practical implications for accessible gait-based emotion analysis and directions for more diverse open datasets.

Abstract

Emotion recognition is relevant for human behaviour understanding, where facial expression and speech recognition have been widely explored by the computer vision community. Literature in the field of behavioural psychology indicates that gait, described as the way a person walks, is an additional indicator of emotions. In this work, we propose a deep framework for emotion recognition through the analysis of gait. More specifically, our model is composed of a sequence of spatial-temporal Graph Convolutional Networks that produce a robust skeleton-based representation for the task of emotion classification. We evaluate our proposed framework on the E-Gait dataset, composed of a total of 2177 samples. The results obtained represent an improvement of approximately 5% in accuracy compared to the state of the art. In addition, during training we observed a faster convergence of our model compared to the state-of-the-art methodologies.

ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos

TL;DR

accuracy, about

percentage points above the STEP baseline, and converges roughly

faster in training. The work also discusses limitations of the dataset and bias/diversity considerations, highlighting practical implications for accessible gait-based emotion analysis and directions for more diverse open datasets.

Abstract

Paper Structure (15 sections, 1 equation, 5 figures, 5 tables)

This paper contains 15 sections, 1 equation, 5 figures, 5 tables.

Introduction
Related Works
Method
(a) Skeletal trajectory extraction.
(b) Skeletal trajectory classification.
Experimental setup
Dataset
Validation metric
Implementation details
Results and discussion
Quantitative analysis.
Qualitative analysis.
Limitations of the E-Gait dataset.
Diversity and Bias in emotion recognition related datasets.
Conclusion

Figures (5)

Figure 1: The proposed architecture for ST-Gait++, composed of $3$ ST-GCN++ blocks with outputs sized $32$, $64$ and $64$, followed by a $2D$ Average Pooling and a $1 \times 1$ Convolution, which reduces dimensionality from $64$ to $4$, which is followed by a Softmax. Ideally, in an application scenario, this can be used along some Skeletal trajectory extraction which takes as input a video and outputs the gait sequences to be analysed by ST-Gait++ automatically. This work focuses on the Skeletal trajectory classification.
Figure 2: Examples of the four categories of the E-Gait dataset. Each item is a sample from one of the categories of the E-Gait dataset, with each frame of the six-frame sequence taken from the whole sample gait sequence. This was done to provide a sense of movement to the reader, so they can better understand E-Gait's characteristics.
Figure 3: Confusion matrices generated from evaluating the models on the test set for (a) ST-Gait++ and (b) STEP. As can be seen, there is a more pronounced diagonal on (a), emphasizing the better accuracy of ST-Gait++. Also, there is less confusion between the emotions Neutral and Happy on ST-Gait++ than on STEP. However, there is some increase in confusion between Happy and Sad on ST-Gait++.
Figure 4: T-SNE representation of the features extracted from the last convolutional layer of ST-Gait++ and STEP on the test set, The inner color represents the ground truth labels and the outer circle represents the model's inference.
Figure 5: Examples of correct and incorrect inferences by ST-Gait++ on the test set.

ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos

TL;DR

Abstract

ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos

Authors

TL;DR

Abstract

Table of Contents

Figures (5)