Real-Time Sign Language Gestures to Speech Transcription using Deep Learning
Brandone Fonya, Clarence Worrell
TL;DR
The study tackles real-time ASL-to-speech transcription by deploying a CNN trained on Sign Language MNIST to classify static hand signs, then translating those gestures into text and speech via a webcam-based pipeline. It integrates OpenCV for video capture, MediaPipe Hands for hand detection, and pyttsx3 for offline text-to-speech, delivering a low-cost, portable solution. The model achieves 95.72% test accuracy on the Sign Language MNIST dataset with strong macro-precision and recall, and the real-time system demonstrates reliable gesture-to-speech translation on standard hardware, albeit with some latency from the hand-tracking stage. Overall, the work provides a practical foundation for inclusive communication technologies and points to clear avenues for handling continuous signing and multilingual sign languages in future iterations.
Abstract
Communication barriers pose significant challenges for individuals with hearing and speech impairments, often limiting their ability to effectively interact in everyday environments. This project introduces a real-time assistive technology solution that leverages advanced deep learning techniques to translate sign language gestures into textual and audible speech. By employing convolution neural networks (CNN) trained on the Sign Language MNIST dataset, the system accurately classifies hand gestures captured live via webcam. Detected gestures are instantaneously translated into their corresponding meanings and transcribed into spoken language using text-to-speech synthesis, thus facilitating seamless communication. Comprehensive experiments demonstrate high model accuracy and robust real-time performance with some latency, highlighting the system's practical applicability as an accessible, reliable, and user-friendly tool for enhancing the autonomy and integration of sign language users in diverse social settings.
