Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences

Cheng Song; Lu Lu; Zhen Ke; Long Gao; Shuai Ding

Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences

Cheng Song, Lu Lu, Zhen Ke, Long Gao, Shuai Ding

TL;DR

This paper tackles emotion recognition from gait under limited labeled data by introducing a self-supervised framework called SSAL. SSAL combines selective strong augmentation (SSA), including upper body jitter and random spatiotemporal masking, with a complementary feature fusion network (CFFN) that merges graph-domain ST-GCN features and image-domain AFF-based features, guided by a distributional divergence loss. The objective blends an InfoNCE-style contrastive term with a distributional divergence term, L = αL_Info + βL_d (with α = β = 1), and employs SimAM-based feature dropping to enhance robustness. Experiments on the Emotion-Gait (E-Gait) and Emilya datasets show SSAL consistently outperforms state-of-the-art self-supervised methods across linear, finetuned, and semi-supervised protocols, especially in low-label settings, highlighting its potential for nonintrusive, remote emotion sensing from gait cues.

Abstract

Emotion recognition is an important part of affective computing. Extracting emotional cues from human gaits yields benefits such as natural interaction, a nonintrusive nature, and remote detection. Recently, the introduction of self-supervised learning techniques offers a practical solution to the issues arising from the scarcity of labeled data in the field of gait-based emotion recognition. However, due to the limited diversity of gaits and the incompleteness of feature representations for skeletons, the existing contrastive learning methods are usually inefficient for the acquisition of gait emotions. In this paper, we propose a contrastive learning framework utilizing selective strong augmentation (SSA) for self-supervised gait-based emotion representation, which aims to derive effective representations from limited labeled gait data. First, we propose an SSA method for the gait emotion recognition task, which includes upper body jitter and random spatiotemporal mask. The goal of SSA is to generate more diverse and targeted positive samples and prompt the model to learn more distinctive and robust feature representations. Then, we design a complementary feature fusion network (CFFN) that facilitates the integration of cross-domain information to acquire topological structural and global adaptive features. Finally, we implement the distributional divergence minimization loss to supervise the representation learning of the generally and strongly augmented queries. Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.

Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences

TL;DR

Abstract

Paper Structure (17 sections, 14 equations, 7 figures, 7 tables)

This paper contains 17 sections, 14 equations, 7 figures, 7 tables.

Introduction
Literature Review
Supervised Gait-Based Emotion Recognition
Self-Supervised Contrastive Learning
Self-Supervised Skeleton Representation
Proposed Method
Overview
Selective Strong Augmentation for Skeleton
Complementary Feature Fusion Network
Loss Function
Experiments
Datasets
Experimental Settings
Evaluation Criteria
Comparison with State-of-the-art
...and 2 more sections

Figures (7)

Figure 1: The existing approaches train deep neural networks to estimate emotion classes from gait data. Supervised methods require ground-truth emotions with gait sequences for model training. Our self-supervised method trains a model from unlabeled gait sequences.
Figure 2: The overall framework of the proposed SSAL. Given an input sequence $s$, through a general augmentation$T$ and a strong augmentation$T^{\prime}$, we obtain general augmentations $s_1$ and $s_2$ and a strong augmentation $s_3$. A momentum-updated key encoder and an MLP extract $z_1$, which is stored in the memory bank and serves as one of the negative samples for the subsequent training steps. The query encoder and an MLP are used to obtain $z_2$ and $z_3$, and the Simam drop is adopted to obtain ${z^{\prime}}_3$.
Figure 3: Visualization of the strong augmentation. (a) We move the joints of the upper limbs to irregular positions while keeping the other joints unchanged. (b) We segment the body into five distinct parts, with each part denoted by a unique color, and then randomly mask one or two parts with zeros. (c) We apply a spatial mask to the skeleton and randomly remove several frames from the sequences, which is equivalent to a spatiotemporal mask.
Figure 4: The architecture of the proposed CFFN. The graph-domain branch is designed with reference to the ST-GCN. The image-domain branch applies an AFF token mixer. Finally, we obtain a 128-dimensional fusion feature vector.
Figure 5: Top-1 accuracy achieved with different general augmentation strategy compositions on the Emilya dataset.
...and 2 more figures

Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences

TL;DR

Abstract

Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences

Authors

TL;DR

Abstract

Table of Contents

Figures (7)