Table of Contents
Fetching ...

Isolated Sign Language Recognition with Segmentation and Pose Estimation

Daniel Perkins, Davis Hunter, Dhrumil Patel, Galen Flanagan

TL;DR

Isolated Sign Language Recognition (ISLR) faces challenges from scarce per-sign data and signer variability. The authors propose an efficient pipeline that fuses pose estimation, segmentation, and a hybrid ResNet–Transformer backbone to capture spatial and temporal cues with reduced computation. On the ASL-Citizen dataset, and under downsampled conditions due to resource limits, the approach achieves substantial validation performance (e.g., 68.5% top-1, ~89.6% top-5) and demonstrates the value of coordinate-based temporal modeling and normalization. The work highlights practical steps toward scalable, real-time ISLR and outlines future directions to scale up training, reintegrate visual features, and optimize hyperparameters.

Abstract

The recent surge in large language models has automated translations of spoken and written languages. However, these advances remain largely inaccessible to American Sign Language (ASL) users, whose language relies on complex visual cues. Isolated sign language recognition (ISLR) - the task of classifying videos of individual signs - can help bridge this gap but is currently limited by scarce per-sign data, high signer variability, and substantial computational costs. We propose a model for ISLR that reduces computational requirements while maintaining robustness to signer variation. Our approach integrates (i) a pose estimation pipeline to extract hand and face joint coordinates, (ii) a segmentation module that isolates relevant information, and (iii) a ResNet-Transformer backbone to jointly model spatial and temporal dependencies.

Isolated Sign Language Recognition with Segmentation and Pose Estimation

TL;DR

Isolated Sign Language Recognition (ISLR) faces challenges from scarce per-sign data and signer variability. The authors propose an efficient pipeline that fuses pose estimation, segmentation, and a hybrid ResNet–Transformer backbone to capture spatial and temporal cues with reduced computation. On the ASL-Citizen dataset, and under downsampled conditions due to resource limits, the approach achieves substantial validation performance (e.g., 68.5% top-1, ~89.6% top-5) and demonstrates the value of coordinate-based temporal modeling and normalization. The work highlights practical steps toward scalable, real-time ISLR and outlines future directions to scale up training, reintegrate visual features, and optimize hyperparameters.

Abstract

The recent surge in large language models has automated translations of spoken and written languages. However, these advances remain largely inaccessible to American Sign Language (ASL) users, whose language relies on complex visual cues. Isolated sign language recognition (ISLR) - the task of classifying videos of individual signs - can help bridge this gap but is currently limited by scarce per-sign data, high signer variability, and substantial computational costs. We propose a model for ISLR that reduces computational requirements while maintaining robustness to signer variation. Our approach integrates (i) a pose estimation pipeline to extract hand and face joint coordinates, (ii) a segmentation module that isolates relevant information, and (iii) a ResNet-Transformer backbone to jointly model spatial and temporal dependencies.

Paper Structure

This paper contains 28 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Examples of frames after pose processing. The joint coordinates of the hand and face are output and used to crop out the hands and face.
  • Figure 2: Illustration of the original proposed model; (A) RGB video input; (B) Normalized video (C) Segmented video and joint coordinates from MediaPipe; (D) In each frame, the segmented video is passed into a ResNet and the coordinates are passed into a transformer; (E) The embeddings from each frame are concatenated and passed into a transformer; (F) After a linear layer, the final prediction is made.
  • Figure 3: Top-1 accuracy of the final model on the validation set.
  • Figure 4: Accuracy of the original model through training. The model was only able to train for 24 epochs after two days of training. Validation accuracy never moved past the baseline of 0.05%
  • Figure 5: Accuracy of the model that uses both the segmented videos and the coordinates as input. The model passes the video frames through a ResNet 18, concatenates these features with the coordinates, and passes them through a transformer on the temporal dimension. The model did not improve its training or validation accuracy during training.
  • ...and 3 more figures