Position and Rotation Invariant Sign Language Recognition from 3D Kinect Data with Recurrent Neural Networks
Prasun Roy, Saumik Bhattacharya, Partha Pratim Roy, Umapada Pal
TL;DR
This work tackles sign language recognition from 3D Kinect data with a focus on position and rotation invariance and user-independence. It constructs a 20-joint skeletal representation from Kinect v1, applies geometric alignment to achieve translation and rotation invariance, and uses an LSTM to model temporal sequences. On a dataset of 2700 Indian Sign Language gestures across 30 classes performed by 10 signers, the method achieves 84.81% accuracy under leave-one-subject cross-validation, outperforming SVM and HMM baselines. The results demonstrate the viability of invariant, sequence-based SLR and point to improvements via bidirectional models, attention mechanisms, and larger datasets for broader applicability.
Abstract
Sign language is a gesture-based symbolic communication medium among speech and hearing impaired people. It also serves as a communication bridge between non-impaired and impaired populations. Unfortunately, in most situations, a non-impaired person is not well conversant in such symbolic languages restricting the natural information flow between these two categories. Therefore, an automated translation mechanism that seamlessly translates sign language into natural language can be highly advantageous. In this paper, we attempt to perform recognition of 30 basic Indian sign gestures. Gestures are represented as temporal sequences of 3D maps (RGB + depth), each consisting of 3D coordinates of 20 body joints captured by the Kinect sensor. A recurrent neural network (RNN) is employed as the classifier. To improve the classifier's performance, we use geometric transformation for the alignment correction of depth frames. In our experiments, the model achieves 84.81% accuracy.
