Table of Contents
Fetching ...

Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio

Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro

TL;DR

The paper tackles inclusive communication by introducing a unified tri-modal framework that simultaneously processes sign language, lip movements, and audio to generate spoken-language text across SLT, VSR, ASR, and AVSR. It designs modality-specific encoders (Sign: Video Swin Transformer; Lip: AV-HuBERT; Audio: Whisper), temporal alignment via length adapters, and a shared mapping to linguistic tokens consumed by an LLM decoder, trained with a two-stage, task-adaptive multi-task strategy. Key findings show that explicitly modeling lip movements as a separate modality significantly improves SLT performance, and the unified model achieves competitive or superior results compared with task-specific models across all four tasks, with strong robustness to noise especially in AVSR. This framework advances inclusive communication tools by enabling flexible, cross-modal language generation and provides analytical insight into modality contributions, particularly the non-manual cues conveyed by lip movements.

Abstract

Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.

Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio

TL;DR

The paper tackles inclusive communication by introducing a unified tri-modal framework that simultaneously processes sign language, lip movements, and audio to generate spoken-language text across SLT, VSR, ASR, and AVSR. It designs modality-specific encoders (Sign: Video Swin Transformer; Lip: AV-HuBERT; Audio: Whisper), temporal alignment via length adapters, and a shared mapping to linguistic tokens consumed by an LLM decoder, trained with a two-stage, task-adaptive multi-task strategy. Key findings show that explicitly modeling lip movements as a separate modality significantly improves SLT performance, and the unified model achieves competitive or superior results compared with task-specific models across all four tasks, with strong robustness to noise especially in AVSR. This framework advances inclusive communication tools by enabling flexible, cross-modal language generation and provides analytical insight into modality contributions, particularly the non-manual cues conveyed by lip movements.

Abstract

Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.

Paper Structure

This paper contains 33 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of our proposed method. Multimodal inputs (sign, lip, audio) are temporally aligned, fused into linguistic tokens, and processed by an LLM to generate language outputs guided by task-adaptive instructions. Note that the actual input modalities vary depending on the task type.
  • Figure 2: Multi-task training progress for AVSR, ASR, VSR, and SLT. Curves show next-token prediction accuracy per epoch. Audio-related tasks are plotted in blue; visual-related tasks in red.
  • Figure 3: Visualization of attention distributions and qualitative comparisons of SLT predictions with and without lip features. From top to bottom: (1) signer frames, (2) corresponding lip-region crops, (3) attention scores averaged over all heads in the last LLM layer, and (4) ground-truth and predicted sentences. Examples on the left and right correspond to the text tokens “web” and “temperament,” respectively.
  • Figure 4: WER comparison of ASR, AVSR, and VSR under varying SNR levels (-5 to 10 dB) using babble noise on the LRS3 dataset. All results are obtained from a single unified model evaluated across the three tasks.