Table of Contents
Fetching ...

A conversational gesture synthesis system based on emotions and semantics

Thanh Hoang-Minh

TL;DR

The paper introduces OHGesture, a diffusion-based gesture synthesis system that generates co-speech gestures conditioned on seed gestures, emotion, speech, and transcribed text. Building on DiffuseStyleGesture, it integrates fast text transcriptions and emotion-guided classifier-free diffusion, with a Transformer-based global fusion and Cross-Local Attention to align semantics and affect with gesture dynamics. Trained on the ZeroEGGS dataset, OHGesture demonstrates strong human-likeness, contextual appropriateness, and ability to interpolate between emotional states, while generalizing to out-of-distribution speech and synthetic voices. Rendering is demonstrated in Unity, enabling multimodal digital humans with enhanced expressivity and controllability for interactive applications.

Abstract

Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices - marking a step forward toward fully multimodal, emotionally aware digital humans.

A conversational gesture synthesis system based on emotions and semantics

TL;DR

The paper introduces OHGesture, a diffusion-based gesture synthesis system that generates co-speech gestures conditioned on seed gestures, emotion, speech, and transcribed text. Building on DiffuseStyleGesture, it integrates fast text transcriptions and emotion-guided classifier-free diffusion, with a Transformer-based global fusion and Cross-Local Attention to align semantics and affect with gesture dynamics. Trained on the ZeroEGGS dataset, OHGesture demonstrates strong human-likeness, contextual appropriateness, and ability to interpolate between emotional states, while generalizing to out-of-distribution speech and synthetic voices. Rendering is demonstrated in Unity, enabling multimodal digital humans with enhanced expressivity and controllability for interactive applications.

Abstract

Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices - marking a step forward toward fully multimodal, emotionally aware digital humans.

Paper Structure

This paper contains 54 sections, 17 equations, 12 figures, 3 tables, 2 algorithms.

Figures (12)

  • Figure 1: Skeleton and joint names of single frame
  • Figure 2: A gesture sequence: the first $N$ frames are used as seed gesture $\mathbf{s}$, and the remaining $M$ frames are to be predicted
  • Figure 3: Common stages in gesture generation models.
  • Figure 4: Illustration of the diffusion drift term in gesture generation. The figure demonstrates how the learned drift guides the reverse diffusion process to synthesize temporally coherent and semantically relevant gestures from noise.
  • Figure 5: Overview of the OHGesture model
  • ...and 7 more figures