Table of Contents
Fetching ...

IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries

Harsh Kavediya, Vighnesh Nayak, Bheeshm Sharma, Balamurugan Palaniappan

TL;DR

This work tackles the problem of translating sign language videos directly into speech without intermediate text representations, focusing on isolated signs for practical real-time communication. It introduces IsoSignVid2Aud, an end-to-end pipeline that combines an I3D-based visual feature extractor, a spectrogram generator, and ISTFT-based audio synthesis, augmented by a Non-Maximal Suppression module for temporal sign isolation. The approach achieves competitive sign recognition accuracy and intelligible audio on ASL-Citizen-1500 and WLASL-100, with metrics such as Top-1 accuracy and PESQ/STOI demonstrating practical viability. By avoiding text intermediaries, the model reduces error cascades and latency, offering a modular foundation for future enhancements and cross-language sign-to-speech deployments.

Abstract

Sign language to spoken language audio translation is important to connect the hearing- and speech-challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end-to-end framework that translates sign language videos with a sequence of possibly non-grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi-stage translation systems. Our approach combines an I3D-based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non-Maximal Suppression (NMS) algorithm for the temporal detection of signs in non-grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL-Citizen-1500 and WLASL-100 datasets with Top-1 accuracies of 72.01\% and 78.67\%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: https://github.com/BheeshmSharma/IsoSignVid2Aud_AIMLsystems-2025.

IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries

TL;DR

This work tackles the problem of translating sign language videos directly into speech without intermediate text representations, focusing on isolated signs for practical real-time communication. It introduces IsoSignVid2Aud, an end-to-end pipeline that combines an I3D-based visual feature extractor, a spectrogram generator, and ISTFT-based audio synthesis, augmented by a Non-Maximal Suppression module for temporal sign isolation. The approach achieves competitive sign recognition accuracy and intelligible audio on ASL-Citizen-1500 and WLASL-100, with metrics such as Top-1 accuracy and PESQ/STOI demonstrating practical viability. By avoiding text intermediaries, the model reduces error cascades and latency, offering a modular foundation for future enhancements and cross-language sign-to-speech deployments.

Abstract

Sign language to spoken language audio translation is important to connect the hearing- and speech-challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end-to-end framework that translates sign language videos with a sequence of possibly non-grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi-stage translation systems. Our approach combines an I3D-based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non-Maximal Suppression (NMS) algorithm for the temporal detection of signs in non-grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL-Citizen-1500 and WLASL-100 datasets with Top-1 accuracies of 72.01\% and 78.67\%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: https://github.com/BheeshmSharma/IsoSignVid2Aud_AIMLsystems-2025.

Paper Structure

This paper contains 30 sections, 7 equations, 1 figure, 3 tables, 1 algorithm.

Figures (1)

  • Figure 1: Proposed Architecture of IsoSignVid2Aud (Video frames are a part of ASL-Citizen dataset desai2023asl)