Table of Contents
Fetching ...

Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues

Varsha Suresh, M. Hamza Mughal, Christian Theobalt, Vera Demberg

TL;DR

This work addresses the challenge of modeling spoken discourse by integrating co-speech gestures into language models. It introduces a gesture-tokenization pipeline based on VQ-VAE to discretize 3D motion into gesture tokens and a feature-alignment stage that maps these tokens into the language-embedding space, followed by LoRA-based fine-tuning on three linguistically grounded text infilling tasks (discourse connectives, quantifiers, stance markers). The results on the BEAT2 dataset show that gesture-augmented models consistently improve marker prediction accuracy and F1 scores, especially for rare markers, demonstrating that non-verbal cues provide complementary information for spoken-language modeling. This work lays groundwork for multimodal spoken discourse modeling and suggests future work on richer gesture data and broader conversational contexts.

Abstract

Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.

Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues

TL;DR

This work addresses the challenge of modeling spoken discourse by integrating co-speech gestures into language models. It introduces a gesture-tokenization pipeline based on VQ-VAE to discretize 3D motion into gesture tokens and a feature-alignment stage that maps these tokens into the language-embedding space, followed by LoRA-based fine-tuning on three linguistically grounded text infilling tasks (discourse connectives, quantifiers, stance markers). The results on the BEAT2 dataset show that gesture-augmented models consistently improve marker prediction accuracy and F1 scores, especially for rare markers, demonstrating that non-verbal cues provide complementary information for spoken-language modeling. This work lays groundwork for multimodal spoken discourse modeling and suggests future work on richer gesture data and broader conversational contexts.

Abstract

Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.

Paper Structure

This paper contains 28 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overall framework of our approach for integrating gestures into language models.
  • Figure 2: Frequency distribution of markers across three tasks: Discourse Connectives, Quantifiers, and Stance Markers in the BEAT2 training data.
  • Figure 3: Relative Confusion Matrices comparing GestureLM and Text-only baseline. The matrix highlights differences in class-wise predictions, with red indicating more Text-only predictions and blue signifying more by GestureLM.
  • Figure 4: Some samples where GestureLM performs better than Text-only model. We see that the semantic gestures co-occur with the spoken discourse markers, potentially leading to improved prediction performance. See the text for detailed description and more examples are shown in the Appendix \ref{['appendix:examples']}.
  • Figure 5: Examples
  • ...and 1 more figures