Enhancing ASL Recognition with GCNs and Successive Residual Connections

Ushnish Sarkar; Archisman Chakraborti; Tapas Samanta; Sarbajit Pal; Amitabha Das

Enhancing ASL Recognition with GCNs and Successive Residual Connections

Ushnish Sarkar, Archisman Chakraborti, Tapas Samanta, Sarbajit Pal, Amitabha Das

TL;DR

The paper tackles ASL recognition by modeling hand landmarks as graphs, addressing the limitations of grid-based CNNs for non-Euclidean hand geometry. It builds graphs from 21 MediaPipe hand landmarks per frame, applies translational and scale normalization (e.g., $x'_i = x_i - x_0$ and $d_{ij} = ||x_i - x_j||_2$ with a scale factor), and processes them with a 3-layer GCN that uses successive residual connections to stabilize training. The approach achieves a validation accuracy of $99.14\%$ on the ASL Alphabet dataset under 5-fold cross-validation, outperforming prior methods and establishing a new benchmark. This pipeline enables robust, potentially real-time ASL translation and can be extended to larger and more diverse sign-language datasets for improved assistive human-computer interaction.

Abstract

This study presents a novel approach for enhancing American Sign Language (ASL) recognition using Graph Convolutional Networks (GCNs) integrated with successive residual connections. The method leverages the MediaPipe framework to extract key landmarks from each hand gesture, which are then used to construct graph representations. A robust preprocessing pipeline, including translational and scale normalization techniques, ensures consistency across the dataset. The constructed graphs are fed into a GCN-based neural architecture with residual connections to improve network stability. The architecture achieves state-of-the-art results, demonstrating superior generalization capabilities with a validation accuracy of 99.14%.

Enhancing ASL Recognition with GCNs and Successive Residual Connections

TL;DR

and

with a scale factor), and processes them with a 3-layer GCN that uses successive residual connections to stabilize training. The approach achieves a validation accuracy of

on the ASL Alphabet dataset under 5-fold cross-validation, outperforming prior methods and establishing a new benchmark. This pipeline enables robust, potentially real-time ASL translation and can be extended to larger and more diverse sign-language datasets for improved assistive human-computer interaction.

Abstract

Paper Structure (13 sections, 9 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 9 equations, 6 figures, 4 tables, 1 algorithm.

Proposed Methods
Preprocessing
MediaPipe Based Graph Construction
Normalizations
Data creation
Network Architecture
GCN Layer
Experiments
Training Details
Results and Discussion
Validation Metrics
Training Time
Conclusion

Figures (6)

Figure 1: The 21 Hand landmarks detected by MediaPipe hands solutions package.
Figure 2: Example of a frame in ASL for the alphabet B and Mediapipe Hand landmark detection on it.
Figure 3: Class distribution of the number of data instances for each class. It is to be noted that only 1 instance of data could be extracted for the class 27, i.e, for the class "DELETE".
Figure 4: Network Architecture showing GCN Layers and Successive Residual connections. Total number of parameters in the model: 142447.
Figure 5: Training and validation losses measured during training. The validation loss curve is always below the training loss curve because of the much fewer number of data instances it was calculated upon. There is a slight instability in the validation loss curve which is due to the smaller batch-size taken during training. These loss curves correspond to the best performing fold on the validation loss in 5-fold cross validation.
...and 1 more figures

Enhancing ASL Recognition with GCNs and Successive Residual Connections

TL;DR

Abstract

Enhancing ASL Recognition with GCNs and Successive Residual Connections

Authors

TL;DR

Abstract

Table of Contents

Figures (6)