Table of Contents
Fetching ...

Real-Time Sign Language Gestures to Speech Transcription using Deep Learning

Brandone Fonya, Clarence Worrell

TL;DR

The study tackles real-time ASL-to-speech transcription by deploying a CNN trained on Sign Language MNIST to classify static hand signs, then translating those gestures into text and speech via a webcam-based pipeline. It integrates OpenCV for video capture, MediaPipe Hands for hand detection, and pyttsx3 for offline text-to-speech, delivering a low-cost, portable solution. The model achieves 95.72% test accuracy on the Sign Language MNIST dataset with strong macro-precision and recall, and the real-time system demonstrates reliable gesture-to-speech translation on standard hardware, albeit with some latency from the hand-tracking stage. Overall, the work provides a practical foundation for inclusive communication technologies and points to clear avenues for handling continuous signing and multilingual sign languages in future iterations.

Abstract

Communication barriers pose significant challenges for individuals with hearing and speech impairments, often limiting their ability to effectively interact in everyday environments. This project introduces a real-time assistive technology solution that leverages advanced deep learning techniques to translate sign language gestures into textual and audible speech. By employing convolution neural networks (CNN) trained on the Sign Language MNIST dataset, the system accurately classifies hand gestures captured live via webcam. Detected gestures are instantaneously translated into their corresponding meanings and transcribed into spoken language using text-to-speech synthesis, thus facilitating seamless communication. Comprehensive experiments demonstrate high model accuracy and robust real-time performance with some latency, highlighting the system's practical applicability as an accessible, reliable, and user-friendly tool for enhancing the autonomy and integration of sign language users in diverse social settings.

Real-Time Sign Language Gestures to Speech Transcription using Deep Learning

TL;DR

The study tackles real-time ASL-to-speech transcription by deploying a CNN trained on Sign Language MNIST to classify static hand signs, then translating those gestures into text and speech via a webcam-based pipeline. It integrates OpenCV for video capture, MediaPipe Hands for hand detection, and pyttsx3 for offline text-to-speech, delivering a low-cost, portable solution. The model achieves 95.72% test accuracy on the Sign Language MNIST dataset with strong macro-precision and recall, and the real-time system demonstrates reliable gesture-to-speech translation on standard hardware, albeit with some latency from the hand-tracking stage. Overall, the work provides a practical foundation for inclusive communication technologies and points to clear avenues for handling continuous signing and multilingual sign languages in future iterations.

Abstract

Communication barriers pose significant challenges for individuals with hearing and speech impairments, often limiting their ability to effectively interact in everyday environments. This project introduces a real-time assistive technology solution that leverages advanced deep learning techniques to translate sign language gestures into textual and audible speech. By employing convolution neural networks (CNN) trained on the Sign Language MNIST dataset, the system accurately classifies hand gestures captured live via webcam. Detected gestures are instantaneously translated into their corresponding meanings and transcribed into spoken language using text-to-speech synthesis, thus facilitating seamless communication. Comprehensive experiments demonstrate high model accuracy and robust real-time performance with some latency, highlighting the system's practical applicability as an accessible, reliable, and user-friendly tool for enhancing the autonomy and integration of sign language users in diverse social settings.

Paper Structure

This paper contains 20 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Hand signs in the American sign language MNIST dataset
  • Figure 2: Model training and validation accuracy over epochs. The model demonstrates rapid convergence with validation accuracy stabilizing above 99% after only a few epochs.
  • Figure 3: Model training and validation loss over epochs.
  • Figure 4: Image showing program running in real time, performaning sign language gesture detection with confidence level shown.