Table of Contents
Fetching ...

HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation

Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, Yanwei Fu

TL;DR

HOP introduces a topology-based heterogeneous multimodal entanglement framework for co-speech gesture generation, jointly modeling text, audio, and action with audio serving as a rhythmic-semantic bridge. It combines a reprogramming-based audio-text adaptor, a spatiotemporal graph encoder for audio-action fusion, and a GAN-backed pose generator to produce coherent, diverse gestures. The approach achieves state-of-the-art results on TED Gesture and TED Expressive in FGD, BC, and diversity, and is validated by qualitative analyses and a user study showing improved naturalness and expressiveness. Overall, HOP advances multimodal co-speech gesture generation by explicitly encoding inter-modal topologies and cross-modality adaptations, enabling more natural human-avatar interactions.

Abstract

Co-speech gestures are crucial non-verbal cues that enhance speech clarity and expressiveness in human communication, which have attracted increasing attention in multimodal research. While the existing methods have made strides in gesture accuracy, challenges remain in generating diverse and coherent gestures, as most approaches assume independence among multimodal inputs and lack explicit modeling of their interactions. In this work, we propose a novel multimodal learning method named HOP for co-speech gesture generation that captures the heterogeneous entanglement between gesture motion, audio rhythm, and text semantics, enabling the generation of coordinated gestures. By leveraging spatiotemporal graph modeling, we achieve the alignment of audio and action. Moreover, to enhance modality coherence, we build the audio-text semantic representation based on a reprogramming module, which is beneficial for cross-modality adaptation. Our approach enables the trimodal system to learn each other's features and represent them in the form of topological entanglement. Extensive experiments demonstrate that HOP achieves state-of-the-art performance, offering more natural and expressive co-speech gesture generation. More information, codes, and demos are available here: https://star-uu-wang.github.io/HOP/

HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation

TL;DR

HOP introduces a topology-based heterogeneous multimodal entanglement framework for co-speech gesture generation, jointly modeling text, audio, and action with audio serving as a rhythmic-semantic bridge. It combines a reprogramming-based audio-text adaptor, a spatiotemporal graph encoder for audio-action fusion, and a GAN-backed pose generator to produce coherent, diverse gestures. The approach achieves state-of-the-art results on TED Gesture and TED Expressive in FGD, BC, and diversity, and is validated by qualitative analyses and a user study showing improved naturalness and expressiveness. Overall, HOP advances multimodal co-speech gesture generation by explicitly encoding inter-modal topologies and cross-modality adaptations, enabling more natural human-avatar interactions.

Abstract

Co-speech gestures are crucial non-verbal cues that enhance speech clarity and expressiveness in human communication, which have attracted increasing attention in multimodal research. While the existing methods have made strides in gesture accuracy, challenges remain in generating diverse and coherent gestures, as most approaches assume independence among multimodal inputs and lack explicit modeling of their interactions. In this work, we propose a novel multimodal learning method named HOP for co-speech gesture generation that captures the heterogeneous entanglement between gesture motion, audio rhythm, and text semantics, enabling the generation of coordinated gestures. By leveraging spatiotemporal graph modeling, we achieve the alignment of audio and action. Moreover, to enhance modality coherence, we build the audio-text semantic representation based on a reprogramming module, which is beneficial for cross-modality adaptation. Our approach enables the trimodal system to learn each other's features and represent them in the form of topological entanglement. Extensive experiments demonstrate that HOP achieves state-of-the-art performance, offering more natural and expressive co-speech gesture generation. More information, codes, and demos are available here: https://star-uu-wang.github.io/HOP/

Paper Structure

This paper contains 15 sections, 9 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: HOP: We propose a topology-based heterogeneous multimodal model that integrates features from audio, text, and action, accounting for their inherent heterogeneity through cross-modality adaptation. The model achieves superior performance on both the TED-Expressive dataset (first row) and the TED dataset (second row), generating gestures that align with the semantics and rhythmic qualities of the speech, as well as the motion characteristics of the real speaker.
  • Figure 2: Overview of the proposed framework for multimodal gesture generation with heterogeneous topology entanglement. Given the input text of speech and the Mel-Spectrum obtained through audio preprocessing, we treat audio sequences as a bridge, linking text sequences and action sequences with distinct topologies. For the connection between text and audio, we apply a reprogramming layer to align data from these different modalities, utilizing a language model to extract embedded semantic information. To link action and audio, we employ the Graph-WaveNet approach to separately extract action and audio features. The entangled multimodal representations are then fed into the gesture generator through topological fusion, resulting in the generation of co-speech gestures.
  • Figure 3: Heterogeneous entanglement of multimodal data. We use red, blue, and green shading to denote text data, audio data, and action data, respectively. While text and action exhibit significant heterogeneity, audio serves as a direct mediator between the two, establishing a path of connectivity that facilitates the full utilization of multimodal data for gesture generation.
  • Figure 4: A showcase of reprogramming in audio-text cross-modality adaptation. We visualize the features before and after reprogramming in the cross-modality adaptation, as well as the feature separation of audio and text before training. It is evident that the audio features are relatively noisier compared to the text features before training. After passing through the reprogramming layer, the correlation between audio and text increases as training progresses, showing a trend of alignment.
  • Figure 5: Visualization of generated gestures. The gestures generated by our method more effectively capture the semantic information in the text, exhibiting a greater range of movement rhythm in the highlighted sections. We highlight the text and its corresponding gesture actions using red and yellow shading, respectively.
  • ...and 3 more figures