Table of Contents
Fetching ...

Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders

Seungbae Kim, Daeun Lee, Brielle Stark, Jinyoung Han

TL;DR

This work tackles the difficulty of speech-based communication for people with language disorders by integrating iconic gestures into speech recognition. It introduces a gesture-aware zero-shot framework that fuses audio with gestures via a multimodal LLM, enabling contextual rewriting of transcripts. Experiments on AphasiaBank using the Peanut Butter Sandwich Task show that gesture information improves semantic interpretation beyond audio alone, with Whisper as a strong baseline whose WER improves with confidence filtering. The approach holds promise for more inclusive assistive technologies and language-therapy tools that capture speaker intent more accurately.

Abstract

Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communication methods, such as gestures, which individuals with language disorders substantially rely on to supplement their communication. Recognizing the need to interpret the latent meanings of visual information not captured by speech alone, we propose a gesture-aware ASR system utilizing a multimodal large language model with zero-shot learning for individuals with speech impairments. Our experiment results and analyses show that including gesture information significantly enhances semantic understanding. This study can help develop effective communication technologies, specifically designed to meet the unique needs of individuals with language impairments.

Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders

TL;DR

This work tackles the difficulty of speech-based communication for people with language disorders by integrating iconic gestures into speech recognition. It introduces a gesture-aware zero-shot framework that fuses audio with gestures via a multimodal LLM, enabling contextual rewriting of transcripts. Experiments on AphasiaBank using the Peanut Butter Sandwich Task show that gesture information improves semantic interpretation beyond audio alone, with Whisper as a strong baseline whose WER improves with confidence filtering. The approach holds promise for more inclusive assistive technologies and language-therapy tools that capture speaker intent more accurately.

Abstract

Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communication methods, such as gestures, which individuals with language disorders substantially rely on to supplement their communication. Recognizing the need to interpret the latent meanings of visual information not captured by speech alone, we propose a gesture-aware ASR system utilizing a multimodal large language model with zero-shot learning for individuals with speech impairments. Our experiment results and analyses show that including gesture information significantly enhances semantic understanding. This study can help develop effective communication technologies, specifically designed to meet the unique needs of individuals with language impairments.

Paper Structure

This paper contains 17 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The overall process of the proposed system. Our model integrates incomplete speech and visual data (i.e., iconic gestures) and generates semantically enriched transcripts.