Table of Contents
Fetching ...

Leveraging Speech for Gesture Detection in Multimodal Communication

Esam Ghaleb, Ilya Burenko, Marlou Rasenberg, Wim Pouw, Ivan Toni, Peter Uhrig, Anna Wilson, Judith Holler, Aslı Özyürek, Raquel Fernández

TL;DR

This work addresses co-speech gesture detection by integrating speech with visual skeletal data using a Transformer-based multimodal framework. It demonstrates that extended speech windows and cross-modal/early fusion outperform unimodal and late-fusion baselines, highlighting the predictive power of speech cues such as MFCC and F0 features for gesture onset. The approach combines ST-GCN-based vision embeddings with VGGish-derived speech embeddings, and evaluates three fusion strategies, achieving peak MAP improvements through ensembling of cross-modal and early fusion. The findings advance understanding of how speech and gestures co-occur in natural communication and offer practical methods for robust multimodal gesture detection in real-world settings.

Abstract

Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.

Leveraging Speech for Gesture Detection in Multimodal Communication

TL;DR

This work addresses co-speech gesture detection by integrating speech with visual skeletal data using a Transformer-based multimodal framework. It demonstrates that extended speech windows and cross-modal/early fusion outperform unimodal and late-fusion baselines, highlighting the predictive power of speech cues such as MFCC and F0 features for gesture onset. The approach combines ST-GCN-based vision embeddings with VGGish-derived speech embeddings, and evaluates three fusion strategies, achieving peak MAP improvements through ensembling of cross-modal and early fusion. The findings advance understanding of how speech and gestures co-occur in natural communication and offer practical methods for robust multimodal gesture detection in real-world settings.

Abstract

Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.
Paper Structure (59 sections, 2 equations, 10 figures, 2 tables)

This paper contains 59 sections, 2 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The figure illustrates the interaction between gestures and speech over two seconds. The speaker starts in a rest position and begins the gesture unit with a preparation phase. This is followed by the stroke phase, which is the meaningful part of the gesture unit. The gesture stroke is semantically related to the accompanying speech, i.e., "rod", sometimes referred to as the lexical affiliate. The speaker ends the gesture unit with a post-stroke hold. Typically, this is followed by a retraction phase, and then the speaker returns to the rest position again. We work with co-speech gestures that vary in form and duration based on the accompanying speech.
  • Figure 2: Distribution of Maximum of MFCCs features when speech occurs with or without gestures. The number of voiced segments is controlled to have a similar distribution with or without gestures. In the first figure, the distribution of the maximum of MFCC[1] is much higher when a gesture accompanies speech.
  • Figure 3: Diagrams of the employed models' architectures. Our approach progressively employs four steps. (1) It starts by preparing speech and vision sequences (explained Section \ref{['sect:dataset']}). (2) it then embeds two modal sequences using modality-specific models (see Section \ref{['sect:embedding_models']}). (3) The framework employs three fusion strategies—late, early, and cross-modal—to create contextualized unimodal, bimodal, or cross-modal embeddings, respectively. Late fusion combines separate modality predictions, while early and cross-modal fusions integrate both streams. (4) It applies a classification step on the embeddings per time window in the sequence.
  • Figure 4: We trained models until convergence. Arguably, in the Cross-Attention scenario, F1-Score drastically increases in the middle of the training procedure after successfully combining inputs from both modalities.
  • Figure 5: Distributions of gesture predictions in early and cross-modal fusion approaches.
  • ...and 5 more figures