Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders
Seungbae Kim, Daeun Lee, Brielle Stark, Jinyoung Han
TL;DR
This work tackles the difficulty of speech-based communication for people with language disorders by integrating iconic gestures into speech recognition. It introduces a gesture-aware zero-shot framework that fuses audio with gestures via a multimodal LLM, enabling contextual rewriting of transcripts. Experiments on AphasiaBank using the Peanut Butter Sandwich Task show that gesture information improves semantic interpretation beyond audio alone, with Whisper as a strong baseline whose WER improves with confidence filtering. The approach holds promise for more inclusive assistive technologies and language-therapy tools that capture speaker intent more accurately.
Abstract
Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communication methods, such as gestures, which individuals with language disorders substantially rely on to supplement their communication. Recognizing the need to interpret the latent meanings of visual information not captured by speech alone, we propose a gesture-aware ASR system utilizing a multimodal large language model with zero-shot learning for individuals with speech impairments. Our experiment results and analyses show that including gesture information significantly enhances semantic understanding. This study can help develop effective communication technologies, specifically designed to meet the unique needs of individuals with language impairments.
