PenSLR: Persian end-to-end Sign Language Recognition Using Ensembling

Amirparsa Salmankhah; Amirreza Rajabi; Negin Kheirmand; Ali Fadaeimanesh; Amirreza Tarabkhah; Amirreza Kazemzadeh; Hamed Farbeh

PenSLR: Persian end-to-end Sign Language Recognition Using Ensembling

Amirparsa Salmankhah, Amirreza Rajabi, Negin Kheirmand, Ali Fadaeimanesh, Amirreza Tarabkhah, Amirreza Kazemzadeh, Hamed Farbeh

TL;DR

PenSLR, a glove-based sign language system consisting of an Inertial Measurement Unit and five flexible sensors powered by a deep learning framework capable of predicting variable-length sequences, is introduced and a novel ensembling technique by leveraging a multiple sequence alignment algorithm known as Star Alignment is proposed.

Abstract

Sign Language Recognition (SLR) is a fast-growing field that aims to fill the communication gaps between the hearing-impaired and people without hearing loss. Existing solutions for Persian Sign Language (PSL) are limited to word-level interpretations, underscoring the need for more advanced and comprehensive solutions. Moreover, previous work on other languages mainly focuses on manipulating the neural network architectures or hardware configurations instead of benefiting from the aggregated results of multiple models. In this paper, we introduce PenSLR, a glove-based sign language system consisting of an Inertial Measurement Unit (IMU) and five flexible sensors powered by a deep learning framework capable of predicting variable-length sequences. We achieve this in an end-to-end manner by leveraging the Connectionist Temporal Classification (CTC) loss function, eliminating the need for segmentation of input signals. To further enhance its capabilities, we propose a novel ensembling technique by leveraging a multiple sequence alignment algorithm known as Star Alignment. Furthermore, we introduce a new PSL dataset, including 16 PSL signs with more than 3000 time-series samples in total. We utilize this dataset to evaluate the performance of our system based on four word-level and sentence-level metrics. Our evaluations show that PenSLR achieves a remarkable word accuracy of 94.58% and 96.70% in subject-independent and subject-dependent setups, respectively. These achievements are attributable to our ensembling algorithm, which not only boosts the word-level performance by 0.51% and 1.32% in the respective scenarios but also yields significant enhancements of 1.46% and 4.00%, respectively, in sentence-level accuracy.

PenSLR: Persian end-to-end Sign Language Recognition Using Ensembling

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 5 figures, 7 tables, 3 algorithms)

This paper contains 24 sections, 8 equations, 5 figures, 7 tables, 3 algorithms.

Introduction
Related Work
Sign Language Recognition
Vision-based Methods
Wireless sensing-based Methods
Wearable-based Methods
Multiple Sequence Alignment
Dataset
Proposed Method
Glove Design
Seq2Seq Model
Preprocessing
Model Architecture
CTC Loss
Sequence Alignment
...and 9 more sections

Figures (5)

Figure 1: Illustration of step-by-step execution of two pairs of PSL glosses ("Blue", "Year") and ("is", "Very") belonging to two distinct similarity groups. The blue glosses have similar finger positions but different hand movements, while the red ones are examples of rotational gestures.
Figure 2: Our sign language glove equipped with an Adafruit BNO055 IMU mounted on the back of the hand and five flexible sensors on each finger.
Figure 3: Architecture of our Seq2Seq model
Figure 4: An illustration of how Star Alignment algorithm computes the similarity matrix between 5 sequences ($S_1="ABCFB"$, $S_2="ABCBBC"$, $S_3="ACFBC"$, $S_4="BCFCC"$, $S_5="ABEFBBC"$) and use it to progressively align them. Sequence $S_2$ is designated as the center since the sum of similarities in the second column is the highest. ($S_{match} = 3$, $S_{gap} = -2$, and $S_{mis} = -1$)
Figure 5: An example of the execution of our ensembling algorithm. The outputs produced by all models are aligned using Star Alignment, after which a voting process is performed to obtain the most probable gesture (denoted by characters A to F) at each position. Finally, the gaps are ignored, and the final result is generated. This example is a situation when the best model (Model 1) can not completely predict the ground truth sequence, but the ensembling method helps the system to generate it successfully.

PenSLR: Persian end-to-end Sign Language Recognition Using Ensembling

TL;DR

Abstract

PenSLR: Persian end-to-end Sign Language Recognition Using Ensembling

Authors

TL;DR

Abstract

Table of Contents

Figures (5)