Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition

Weichao Zhao; Wengang Zhou; Hezhen Hu; Min Wang; Houqiang Li

Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition

Weichao Zhao, Wengang Zhou, Hezhen Hu, Min Wang, Houqiang Li

TL;DR

This work tackles sign language recognition under limited labeled data by introducing a self-supervised pre-training framework. It leverages spatial-temporal consistency from two perspectives hand trunk granularity and joint motion modalities within a MoCo-style contrastive setup, enabling robust instance discrimination. A bidirectional knowledge transfer module enforces cross-modal alignment while an intra-modal consistency constraint preserves holistic semantics. Experiments on four benchmarks demonstrate state-of-the-art performance and strong generalization across linear and semi-supervised protocols, with code released.

Abstract

Recently, there have been efforts to improve the performance in sign language recognition by designing self-supervised learning methods. However, these methods capture limited information from sign pose data in a frame-wise learning manner, leading to sub-optimal solutions. To this end, we propose a simple yet effective self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency from two distinct perspectives and learn instance discriminative representation for sign language recognition. On one hand, since the semantics of sign language are expressed by the cooperation of fine-grained hands and coarse-grained trunks, we utilize both granularity information and encode them into latent spaces. The consistency between hand and trunk features is constrained to encourage learning consistent representation of instance samples. On the other hand, inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Additionally, we further bridge the interaction between the embedding spaces of both modalities, facilitating bidirectional knowledge transfer to enhance sign language representation. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin. The source code is publicly available at https://github.com/sakura/Code.

Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition

TL;DR

Abstract

Paper Structure (19 sections, 8 equations, 6 figures, 14 tables)

This paper contains 19 sections, 8 equations, 6 figures, 14 tables.

Introduction
Related Work
Sign Language Recognition
Self-supervised Representation Learning
Video Consistency Representation Learning
Approach
Preliminaries
Single Modality Branch
Bidirectional Reliable Knowledge Transfer
Overall Objective
Model Details
Experiments
Implementation Details
Datasets and Metrics
Comparison with State-of-the-art Methods
...and 4 more sections

Figures (6)

Figure 1: Illustration of our proposed pre-training method. It explicitly mines the spatial-temporal consistency in the sign pose sequence from two perspectives. One focuses on different granularity information from hand and trunk. The other involves different order information from joint and motion modalities. Their consistencies are measured in the semantic space for discriminative sign language representation.
Figure 2: The overall pipeline of the proposed framework during pre-training. The input sequence consists of triplet parts, i.e., both hands and trunk. Meanwhile, we extract the first-order motion from joints in different parts. Then we feed them into two branches to learn instance discriminative representation in a contrastive learning paradigm supervised by contrastive loss $\mathcal{L}_{CL}^J$ and $\mathcal{L}_{CL}^M$, respectively. The key encoder is momentum updated by the query encoder. In addition, we constrain the consistency of hand and trunk features in each branch, i.e.,$\mathcal{L}_{con}^J$ and $\mathcal{L}_{con}^M$. Furthermore, we design the bidirectional knowledge transfer module to convey reliable information during cross-modal interaction supervised by $\mathcal{L}_{KT}$. "J" and "M" denote the abbreviations of joint and motion.
Figure 3: t-SNE van2008visualizing visualization of feature embeddings. We sample 34 sign words from SLR500 dataset and visualize the features extracted from our proposed method and single branches, denoted as "Ours", "Only Motion" and "Only Joint", respectively.
Figure 4: Impact of the number of neighbors $K$ in knowledge transfer module on MSASL dataset. The horizontal axis indicates the value of $K$, while the vertical axis indicates the top-1 accuracy.
Figure 5: Qualitative results of the effectiveness of different order information. We visualize several samples that the different signers perform the same sign words. Due to the variability among human characteristics, the static joint modality of sign pose sequences occasionally causes the wrong prediction results. Different from it, the dynamic motion among different signers is consistent and involves the complementary representation to the joint modality. The combination of both modalities predicts better results.
...and 1 more figures

Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition

TL;DR

Abstract

Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (6)