Isolated Sign Language Recognition with Segmentation and Pose Estimation
Daniel Perkins, Davis Hunter, Dhrumil Patel, Galen Flanagan
TL;DR
Isolated Sign Language Recognition (ISLR) faces challenges from scarce per-sign data and signer variability. The authors propose an efficient pipeline that fuses pose estimation, segmentation, and a hybrid ResNet–Transformer backbone to capture spatial and temporal cues with reduced computation. On the ASL-Citizen dataset, and under downsampled conditions due to resource limits, the approach achieves substantial validation performance (e.g., 68.5% top-1, ~89.6% top-5) and demonstrates the value of coordinate-based temporal modeling and normalization. The work highlights practical steps toward scalable, real-time ISLR and outlines future directions to scale up training, reintegrate visual features, and optimize hyperparameters.
Abstract
The recent surge in large language models has automated translations of spoken and written languages. However, these advances remain largely inaccessible to American Sign Language (ASL) users, whose language relies on complex visual cues. Isolated sign language recognition (ISLR) - the task of classifying videos of individual signs - can help bridge this gap but is currently limited by scarce per-sign data, high signer variability, and substantial computational costs. We propose a model for ISLR that reduces computational requirements while maintaining robustness to signer variation. Our approach integrates (i) a pose estimation pipeline to extract hand and face joint coordinates, (ii) a segmentation module that isolates relevant information, and (iii) a ResNet-Transformer backbone to jointly model spatial and temporal dependencies.
