An Open-Source American Sign Language Fingerspell Recognition and Semantic Pose Retrieval Interface
Kevin Jose Thomas
TL;DR
This work tackles the need for accessible ASL translation tools by delivering an open-source interface that separately handles fingerspell recognition and semantic pose retrieval. Recognition leverages Google MediaPipe landmarks with two classifiers (a lightweight 2D CNN and a 3D PointNet) to convert fingerspelling into spoken English, followed by BERT-based syntactic correction. Production maps spoken English to ASL gloss via an LLM and retrieves sign poses semantically through a pgvector-embedded pose database, stitching together pose sequences for fluent signing. The system operates in real time and is robust to varying backgrounds, lighting, skin tones, and hand sizes, offering a practical stepping-stone toward full ASL translation and enabling developers to build accessible sign-language-enabled applications.
Abstract
This paper introduces an open-source interface for American Sign Language fingerspell recognition and semantic pose retrieval, aimed to serve as a stepping stone towards more advanced sign language translation systems. Utilizing a combination of convolutional neural networks and pose estimation models, the interface provides two modular components: a recognition module for translating ASL fingerspelling into spoken English and a production module for converting spoken English into ASL pose sequences. The system is designed to be highly accessible, user-friendly, and capable of functioning in real-time under varying environmental conditions like backgrounds, lighting, skin tones, and hand sizes. We discuss the technical details of the model architecture, application in the wild, as well as potential future enhancements for real-world consumer applications.
