Sign Language Recognition Based On Facial Expression and Hand Skeleton

Zhiyu Long; Xingyou Liu; Jiaqi Qiao; Zhi Li

Sign Language Recognition Based On Facial Expression and Hand Skeleton

Zhiyu Long, Xingyou Liu, Jiaqi Qiao, Zhi Li

TL;DR

This work tackles sign language recognition from monocular video by fusing hand-skeleton coordinates transformed to a canonical hand-coordinate system with facial-expression confidences obtained via a Self-Cure Network. Per-frame features combine 21 hand joints in 3D (transformed) with 7 facial-expression scores into a vector of length $21\times3+7$, which are processed by a CNN for spatial features and an LSTM for temporal dynamics. The proposed data-compensation network outperforms a baseline CNN-LSTM, achieving 92% accuracy on SEUCSLRD and 91.25% on LSA64 (facial features alone achieving 80% and 91.25% respectively in ablations), demonstrating the value of integrating both manual and non-manual cues for robust sign-language recognition. The results suggest this approach can improve practical sign-language systems under varied environments and signers by leveraging coordinated hand-posture representations and facial expressions.

Abstract

Sign language is a visual language used by the deaf and dumb community to communicate. However, for most recognition methods based on monocular cameras, the recognition accuracy is low and the robustness is poor. Even if the effect is good on some data, it may perform poorly in other data with different interference due to the inability to extract effective features. To solve these problems, we propose a sign language recognition network that integrates skeleton features of hands and facial expression. Among this, we propose a hand skeleton feature extraction based on coordinate transformation to describe the shape of the hand more accurately. Moreover, by incorporating facial expression information, the accuracy and robustness of sign language recognition are finally improved, which was verified on A Dataset for Argentinian Sign Language and SEU's Chinese Sign Language Recognition Database (SEUCSLRD).

Sign Language Recognition Based On Facial Expression and Hand Skeleton

TL;DR

, which are processed by a CNN for spatial features and an LSTM for temporal dynamics. The proposed data-compensation network outperforms a baseline CNN-LSTM, achieving 92% accuracy on SEUCSLRD and 91.25% on LSA64 (facial features alone achieving 80% and 91.25% respectively in ablations), demonstrating the value of integrating both manual and non-manual cues for robust sign-language recognition. The results suggest this approach can improve practical sign-language systems under varied environments and signers by leveraging coordinated hand-posture representations and facial expressions.

Abstract

Paper Structure (11 sections, 3 equations, 4 figures, 1 table)

This paper contains 11 sections, 3 equations, 4 figures, 1 table.

Introduction
Related Work
Method
Data Preprocessing
Skeleton Data Extraction Based on Coordinate Transformation
Facial Expression Data Extraction
Data Compensation Network
Experiments
Sign language dataset
Experimental analysis and discussion
Conclusion

Figures (4)

Figure 1: Mediapipe hand joint points' corresponding serial numbers.
Figure 2: Transformation from world coordinate system (left) to hand coordinate system (right).
Figure 3: New sign language recognition data compensation network
Figure 4: Comparison of accuracy of different methods on SEUCSLRD and LSA64 datasets

Sign Language Recognition Based On Facial Expression and Hand Skeleton

TL;DR

Abstract

Sign Language Recognition Based On Facial Expression and Hand Skeleton

Authors

TL;DR

Abstract

Table of Contents

Figures (4)