Table of Contents
Fetching ...

Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition

Muxin Pu, Mei Kuan Lim, Chun Yong Chong

TL;DR

Sign language recognition from video is challenged by realistic hand poses, missing data, and computational constraints. Siformer addresses these by three innovations: kinematic hand pose rectification to enforce joint constraints, a feature-isolated Transformer that encodes hand and body features separately, and an input-adaptive inference mechanism with internal classifiers for early exiting. The approach demonstrates state-of-the-art performance on WLASL100 and LSA64, achieving 86.50% top-1 on WLASL100 and 99.84% on LSA64, with ablations confirming the contributions of each component. The method offers robustness to missing data and improved efficiency, making it suitable for portable deployment in real-world SLR tasks. Future work could extend realism to facial cues and explore broader, cross-linguistic datasets.

Abstract

Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training or inference phases, and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art (SOTA) performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50\%, marking a relative improvement of 2.39% over the previous SOTA. For LSA64, we achieve a top-1 accuracy of 99.84%.

Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition

TL;DR

Sign language recognition from video is challenged by realistic hand poses, missing data, and computational constraints. Siformer addresses these by three innovations: kinematic hand pose rectification to enforce joint constraints, a feature-isolated Transformer that encodes hand and body features separately, and an input-adaptive inference mechanism with internal classifiers for early exiting. The approach demonstrates state-of-the-art performance on WLASL100 and LSA64, achieving 86.50% top-1 on WLASL100 and 99.84% on LSA64, with ablations confirming the contributions of each component. The method offers robustness to missing data and improved efficiency, making it suitable for portable deployment in real-world SLR tasks. Future work could extend realism to facial cues and explore broader, cross-linguistic datasets.

Abstract

Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training or inference phases, and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art (SOTA) performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50\%, marking a relative improvement of 2.39% over the previous SOTA. For LSA64, we achieve a top-1 accuracy of 99.84%.

Paper Structure

This paper contains 18 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The structure of the hand skeleton along with the respective joint names, including distal interphalangeal (DIP), proximal interphalangeal (PIP), metacarpophalangeal (MCP), carpometacarpal (CMC), interphalangeal (IP) joints, based on Hochschild2016Rossthieme2006isaac2022single.
  • Figure 2: Examples of adduction, abduction, flexion, and extension of hands are referenced in Hochschild2016Rossthieme2006cabibihan2021suitability.
  • Figure 3: The core components of our proposed method Siformer: (1) Kinematic rectification is applied to correct poses of sign glosses, aiming to provide realistic representations. (2) We propose a feature-isolated mechanism that captures local spatial-temporal context concurrently and independently from individual features during the encoding phase. This is followed by combinatorial-dependent decoding. (3) We integrate an internal classifier at each layer to achieve input-adaptive inference. The provided example illustrates a case when the patience value is set to 1.
  • Figure 4: Rectification analysis based on the variations of alpha $\alpha$ values on the WLASL100 datatset
  • Figure 5: Robustness testing against missing data on the WLASL100 datatset