Table of Contents
Fetching ...

CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

Fengyi Fang, Sicheng Yang, Wenming Yang

TL;DR

CoordSpeaker tackles semantic grounding and coordination in co-speech gesture generation by introducing a gesture captioning framework that produces multi-granular captions, paired with a unified latent diffusion-based gesture generator guided by hierarchical conditioning. The approach enables simultaneous, semantically coherent non-spontaneous gestures and rhythmically synchronized spontaneous gestures, achieving high-quality results with improved efficiency. Key contributions include the gesture captioning module, a unified cross-dataset motion representation, and a hierarchically controlled denoiser for multimodal control, validated by extensive qualitative, quantitative, and perceptual evaluations. The work advances bidirectional gesture-text mapping and offers practical benefits for avatar realism, public speaking simulations, and interactive agents.

Abstract

Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.

CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

TL;DR

CoordSpeaker tackles semantic grounding and coordination in co-speech gesture generation by introducing a gesture captioning framework that produces multi-granular captions, paired with a unified latent diffusion-based gesture generator guided by hierarchical conditioning. The approach enables simultaneous, semantically coherent non-spontaneous gestures and rhythmically synchronized spontaneous gestures, achieving high-quality results with improved efficiency. Key contributions include the gesture captioning module, a unified cross-dataset motion representation, and a hierarchically controlled denoiser for multimodal control, validated by extensive qualitative, quantitative, and perceptual evaluations. The work advances bidirectional gesture-text mapping and offers practical benefits for avatar realism, public speaking simulations, and interactive agents.

Abstract

Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.

Paper Structure

This paper contains 53 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: CoordSpeaker exploits gesture captioning to enable customized coordinated speaker gesture generation, producing both co-speech spontaneous gestures and caption-driven non-spontaneous motions. In a speech scenario, our method allows the speaker to naturally walk forward and bow while speaking, seamlessly delivering a closing gesture.
  • Figure 2: Overview of CoordSpeaker. We first introduce Gesture Captioning (Sec. \ref{['sec:method:caption']}) to bridge the semantic prior gap of gesture data, generating descriptive, multi-granular gesture captions at low cost. Subsequently, we propose a Coordinated Gesture Generation Model (Sec. \ref{['sec:method:generation']}) enabling harmonious coordination over heterogeneous multi-modal and multi-scale conditions. Our model can generate both rhythmically-synchronous and semantically-coherent gestures with high quality and superior efficiency.
  • Figure 3: Model overview. (Top) Gesture Captioning Framework: A motion tokenizer and a motion-aware language model (MotionLLM) generate descriptive gesture [captions] from predefined [prompt] and [motion] inputs, addressing the semantic prior gap efficiently. A multi-granular captioning mechanism further enhances multi-scale semantic alignment via three strategies: Regular, Dynamic, and Hierarchical. (Bottom) Coordinated Gesture Generation Model: A gesture VAE first learns a unified latent motion space for cross-dataset modeling. A conditional latent diffusion model with a hierarchically controlled denoiser enables efficient and coordinated gesture generation via hierarchical multimodal condition injection.
  • Figure 4: Qualitative comparison of coordinated gesture generation. Red boxes highlight semantic inconsistencies, yellow boxes indicate unnatural motions, and green boxes denote well-coordinated natural gestures. More results are in supplementary material (Sec. \ref{['sec:appendix_visual']}).
  • Figure 5: Visualization results. (a) Gesture captioning examples. Our captioning framework can effectively describe both overall motion patterns and fine-grained details. More results are in supplementary material (Sec. \ref{['sec:appendix_visual']}). (b) Quantitative captioning evaluation. Our model performs comparably to human annotations. (c) Qualitative ablation study. Results are generated using audio and single caption.
  • ...and 3 more figures