Table of Contents
Fetching ...

Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?

Lokesh Kumar, Nirmesh Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik

Abstract

Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/

Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?

Abstract

Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/
Paper Structure (19 sections, 9 equations, 3 figures, 4 tables)

This paper contains 19 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of (a) the proposed Gesture2Speech architecture and (b) MoE layer. The system takes text, speech, and video-based gesture features as input and generates expressive speech. Multiple MoE modules enable dynamic routing of features for improved style representation and are aligned with gestural intent via cross-attention.
  • Figure 2: Comparison of gesture apexes and speech prominence peaks between ground truth and TTS-generated speech. The Cross-Modal Temporal Distance (CMTD) quantifies alignment between gestures and prosodic peaks. Ground truth CMTD: 0.819 seconds, indicating looser temporal alignment, while TTS CMTD: 0.727 seconds suggests improved synchronization with gestures.
  • Figure 3: Speaker-wise t-SNE plots analysis of proposed style transfer network. The t-SNE plot of (a) Speech MoE embeddings, (b) video MoE embeddings and (c) Multimodal MoE embeddings.