Table of Contents
Fetching ...

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

Tenglong Ao, Zeyi Zhang, Libin Liu

TL;DR

GestureDiffuCLIP introduces a diffusion-based generator for co-speech gestures guided by CLIP latents to allow flexible, multimodal style prompts (text, motion, or video). It couples a gesture–transcript semantic embedding learned via contrastive learning with a CLIP-guided AdaIN style injector to achieve semantically coherent, stylistically diverse gestures conditioned on speech. The system demonstrates state-of-the-art performance on motion quality, semantic-content matching, and style control across BEAT and ZeroEGGS, with robust zero-shot style generalization and real-time-feasible inference. This work enables richly controllable, stylized gesture synthesis for animated avatars and storytelling applications, and suggests scalable pathways for cross-modal generation with CLIP latent guidance.

Abstract

The automatic generation of stylized co-speech gestures has recently received increasing attention. Previous systems typically allow style control via predefined text labels or example motion clips, which are often not flexible enough to convey user intent accurately. In this work, we present GestureDiffuCLIP, a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control. We leverage the power of the large-scale Contrastive-Language-Image-Pre-training (CLIP) model and present a novel CLIP-guided mechanism that extracts efficient style representations from multiple input modalities, such as a piece of text, an example motion clip, or a video. Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator via an adaptive instance normalization (AdaIN) layer. We further devise a gesture-transcript alignment mechanism that ensures a semantically correct gesture generation based on contrastive learning. Our system can also be extended to allow fine-grained style control of individual body parts. We demonstrate an extensive set of examples showing the flexibility and generalizability of our model to a variety of style descriptions. In a user study, we show that our system outperforms the state-of-the-art approaches regarding human likeness, appropriateness, and style correctness.

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

TL;DR

GestureDiffuCLIP introduces a diffusion-based generator for co-speech gestures guided by CLIP latents to allow flexible, multimodal style prompts (text, motion, or video). It couples a gesture–transcript semantic embedding learned via contrastive learning with a CLIP-guided AdaIN style injector to achieve semantically coherent, stylistically diverse gestures conditioned on speech. The system demonstrates state-of-the-art performance on motion quality, semantic-content matching, and style control across BEAT and ZeroEGGS, with robust zero-shot style generalization and real-time-feasible inference. This work enables richly controllable, stylized gesture synthesis for animated avatars and storytelling applications, and suggests scalable pathways for cross-modal generation with CLIP latent guidance.

Abstract

The automatic generation of stylized co-speech gestures has recently received increasing attention. Previous systems typically allow style control via predefined text labels or example motion clips, which are often not flexible enough to convey user intent accurately. In this work, we present GestureDiffuCLIP, a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control. We leverage the power of the large-scale Contrastive-Language-Image-Pre-training (CLIP) model and present a novel CLIP-guided mechanism that extracts efficient style representations from multiple input modalities, such as a piece of text, an example motion clip, or a video. Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator via an adaptive instance normalization (AdaIN) layer. We further devise a gesture-transcript alignment mechanism that ensures a semantically correct gesture generation based on contrastive learning. Our system can also be extended to allow fine-grained style control of individual body parts. We demonstrate an extensive set of examples showing the flexibility and generalizability of our model to a variety of style descriptions. In a user study, we show that our system outperforms the state-of-the-art approaches regarding human likeness, appropriateness, and style correctness.
Paper Structure (39 sections, 20 equations, 14 figures, 2 tables)

This paper contains 39 sections, 20 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Our system consists of two core components: (a) a latent diffusion model that takes speech audio and transcript as input and generate co-speech gestures, and (b) a CLIP-based encoder that extracts style embeddings from an arbitrary style prompt and incorporates them into the diffusion model via an adaptive instance normalization (AdaIN) layer. The system allows using short texts, video clips, and motion sequences to define gesture styles by encoding them into the same CLIP embedding space using corresponding pretrained encoders.
  • Figure 2: We learn an gesture-transcript joint embedding space using contrastive learning. A transcript encoder is trained to convert a transcript sentence $\bm{T}$ into a sequence of feature codes $\bm{Z}^t$, which are then aggregated into a transcript embedding vector $\bm{z}^t$ via max pooling. Similarly, the corresponding gesture sequence $\bm{\hat{Z}}$ is processed by a gesture encoder, resulting in a feature sequence $\bm{Z}^g$ and the corresponding embedding $\bm{z}^g$. The encoders are trained using a contrastive loss that maximizes the similarity between the embeddings $\bm{z}^t$ and $\bm{z}^g$ of paired transcripts and gestures.
  • Figure 3: An illustration of the CLIP-style contrastive loss used to train the gesture and transcript encoders.
  • Figure 4: Applications of the gesture-transcript joint embeddings. (a) Motion-based transcripts retrieval . (b) Semantic saliency identification.
  • Figure 5: Architecture of the denoising network. The model is a multi-layer transformer with a causal attention structure. It takes the audio and transcript of a speech, along with a style prompt, as input and estimates the diffusion noise. Three CLIP-based encoders are learned to support different types of style prompts. The multimodal features are integrated into the network at various stages through semantics-aware layers and AdaIN layers, respectively. Norm refers to the layer normalization and FFN is the feed-forward network.
  • ...and 9 more figures