Table of Contents
Fetching ...

HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

Lanmiao Liu, Esam Ghaleb, Aslı Özyürek, Zerrin Yumak

Abstract

While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.

HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

Abstract

While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.

Paper Structure

This paper contains 26 sections, 17 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overall idea of our model. Audio and text transcripts are encoded into a shared semantic representation, where congruent (matching) and incongruent (mismatching) speech and gesture latents form contrastive pairs. Contrastive flow matching learns motion trajectories that align with the correct semantic context while diverging from mismatched samples, producing coherent full-body gestures.
  • Figure 2: Our architecture comprises two synergistic modules: (a) Semantics-Aware Composite Module (SACM), which anchors audio, textual, and motion features within an aligned semantic space to ensure cross-modal consistency; and (b) Multimodal Conditioning Module, which leverages a learned velocity field conditioned on multimodal priors and stochastic seed latents to synthesize expressive, kinematically coherent holistic gestures.
  • Figure 3: Overview of the proposed semantic-aware contrastive flow matching framework for gesture generation. Audio and text transcripts are encoded into a shared semantic representation where congruent and incongruent gesture latents form contrastive pairs. Contrastive flow matching then learns motion trajectories aligned with the correct semantic context while diverging from mismatched inputs, producing coherent full-body gestures.
  • Figure 4: Qualitative comparison of semantic gesture generation on BEAT2. Our model synthesizes context-grounded gestures that precisely map to linguistic semantics: discourse marker “Well”, deictic pointing for “for me” rhythmic denial for “never” and iconic explanation for “because”. In contrast, baselines typically collapse into semantically neutral, rhythm-driven arm swings, failing to capture the distinct communicative functions of each clause.
  • Figure 5: User study evaluation results comparing Ground Truth, Our Model, SemTalk, and EMAGE across three criteria: Naturalness, Diversity, and Alignment with Speech Content and Timing. Error bars denote standard deviation across participants. Statistical significance is indicated by * ($p<0.05$) and ** ($p<0.01$).