Table of Contents
Fetching ...

STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits

Uttaran Bhattacharya, Trisha Mittal, Rohan Chandra, Tanmay Randhavane, Aniket Bera, Dinesh Manocha

TL;DR

A novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture, which can learn the affective features and exhibits classification accuracy of 88% on E-Gait, which is 14–30% more accurate over prior methods.

Abstract

We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the emotional state of the human into one of four emotions: happy, sad, angry, or neutral. We use hundreds of annotated real-world gait videos and augment them with thousands of annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP. We also release a novel dataset (E-Gait), which consists of $2,177$ human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 89% on E-Gait, which is 14 - 30% more accurate over prior methods.

STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits

TL;DR

A novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture, which can learn the affective features and exhibits classification accuracy of 88% on E-Gait, which is 14–30% more accurate over prior methods.

Abstract

We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the emotional state of the human into one of four emotions: happy, sad, angry, or neutral. We use hundreds of annotated real-world gait videos and augment them with thousands of annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP. We also release a novel dataset (E-Gait), which consists of human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 89% on E-Gait, which is 14 - 30% more accurate over prior methods.

Paper Structure

This paper contains 17 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: STEP and STEP-Gen: We present a novel classifier network (STEP) to predict perceived emotions from gaits, as shown for this walking video. Furthermore, we present a generator network (STEP-Gen) to generate annotated synthetic gaits from our real world gait dataset to improve the accuracy of STEP. We evaluate their performance on a novel E-Gait dataset and observe $14-30\%$ improvement in the classification accuracy over prior methods.
  • Figure 2: Our Generation Network (STEP-Gen): The encoder consists of ST-GCN, Average Pool and Conv2D layers. The decoder consists of DeConv2D, Repeat and ST-GDCN layers. RSG (Random Sample Generator) is used to generate random samples from the latent space. $+$ denotes appending; $T$: number of time steps ($75$ in our dataset); $V$: number of nodes ($16$ in our dataset); $C$: dimension of each node ($3$ in our dataset). Input: Human gaits processed from walking videos and corresponding emotion label. Spheres are nodes, thick red lines are spatial edges and thin gray lines are temporal edges. Output: Human gaits corresponding to the input label, with same $T$, $V$, and $C$.
  • Figure 3: Our Classifier Network (STEP): It consists of ST-GCN, Average Pool, Conv2D and fully connected (FC) layers. $+$ denotes appending. $T$: number of time steps ($75$ in our dataset); $V$: number of nodes ($16$ in our dataset); $C$: dimension of each node ($3$ in our dataset). Input: Human gaits processed from walking videos. Spheres are nodes, thick red lines are spatial edges and thin gray lines are temporal edges. Output: Predicted label after Softmax. The first Softmax from the left gives the output of Baseline-SETP, and the second Softmax gives the output of STEP.
  • Figure 4: Effect of Data Augmentation: Effect of augmenting synthetically generated data to the train and test sets of STEP+Aug on its performance. For every percent improvement in accuracy, an exponentially larger number of data need to be augmented.
  • Figure 5: Training Loss Convergence: Our "Push-Pull" regularization loss (Equation \ref{['equation:STEP-Gen Loss']}) as a function of training epochs, as produced by the baseline-CVAE and our STEP-Gen. The baseline-CVAE fails to converge even after $150$ epochs, while STEP-Gen converges after approximately 28 epochs.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 4.1