Table of Contents
Fetching ...

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference

Fan Zhang, Zhaohan Wang, Xin Lyu, Siyuan Zhao, Mengjian Li, Weidong Geng, Naye Ji, Hui Du, Fuxing Gao, Hao Wu, Shunman Li

TL;DR

Persona-Gestor addresses the challenge of synthesizing personalized co-speech gestures from raw speech by introducing a fuzzy feature inference engine and an AdaLN transformer within a diffusion framework. The model eliminates the need for explicit style labels, producing high-fidelity, full-body gestures that synchronize with speech while preserving naturalness. Rigorous evaluations on Trinity, ZEGGS, and BEAT demonstrate state-of-the-art Fréchet Gesture Distances and strong subjective quality, alongside robust generalization to in-the-wild audio. This work advances practical, user-friendly virtual humans by enabling speaker-aware gesture synthesis from acoustic input alone, with potential impact on animation, virtual assistants, and HCI applications.

Abstract

Speech-driven gesture generation is an emerging field within virtual human creation. However, a significant challenge lies in accurately determining and processing the multitude of input features (such as acoustic, semantic, emotional, personality, and even subtle unknown features). Traditional approaches, reliant on various explicit feature inputs and complex multimodal processing, constrain the expressiveness of resulting gestures and limit their applicability. To address these challenges, we present Persona-Gestor, a novel end-to-end generative model designed to generate highly personalized 3D full-body gestures solely relying on raw speech audio. The model combines a fuzzy feature extractor and a non-autoregressive Adaptive Layer Normalization (AdaLN) transformer diffusion architecture. The fuzzy feature extractor harnesses a fuzzy inference strategy that automatically infers implicit, continuous fuzzy features. These fuzzy features, represented as a unified latent feature, are fed into the AdaLN transformer. The AdaLN transformer introduces a conditional mechanism that applies a uniform function across all tokens, thereby effectively modeling the correlation between the fuzzy features and the gesture sequence. This module ensures a high level of gesture-speech synchronization while preserving naturalness. Finally, we employ the diffusion model to train and infer various gestures. Extensive subjective and objective evaluations on the Trinity, ZEGGS, and BEAT datasets confirm our model's superior performance to the current state-of-the-art approaches. Persona-Gestor improves the system's usability and generalization capabilities, setting a new benchmark in speech-driven gesture synthesis and broadening the horizon for virtual human technology. Supplementary videos and code can be accessed at https://zf223669.github.io/Diffmotion-v2-website/

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference

TL;DR

Persona-Gestor addresses the challenge of synthesizing personalized co-speech gestures from raw speech by introducing a fuzzy feature inference engine and an AdaLN transformer within a diffusion framework. The model eliminates the need for explicit style labels, producing high-fidelity, full-body gestures that synchronize with speech while preserving naturalness. Rigorous evaluations on Trinity, ZEGGS, and BEAT demonstrate state-of-the-art Fréchet Gesture Distances and strong subjective quality, alongside robust generalization to in-the-wild audio. This work advances practical, user-friendly virtual humans by enabling speaker-aware gesture synthesis from acoustic input alone, with potential impact on animation, virtual assistants, and HCI applications.

Abstract

Speech-driven gesture generation is an emerging field within virtual human creation. However, a significant challenge lies in accurately determining and processing the multitude of input features (such as acoustic, semantic, emotional, personality, and even subtle unknown features). Traditional approaches, reliant on various explicit feature inputs and complex multimodal processing, constrain the expressiveness of resulting gestures and limit their applicability. To address these challenges, we present Persona-Gestor, a novel end-to-end generative model designed to generate highly personalized 3D full-body gestures solely relying on raw speech audio. The model combines a fuzzy feature extractor and a non-autoregressive Adaptive Layer Normalization (AdaLN) transformer diffusion architecture. The fuzzy feature extractor harnesses a fuzzy inference strategy that automatically infers implicit, continuous fuzzy features. These fuzzy features, represented as a unified latent feature, are fed into the AdaLN transformer. The AdaLN transformer introduces a conditional mechanism that applies a uniform function across all tokens, thereby effectively modeling the correlation between the fuzzy features and the gesture sequence. This module ensures a high level of gesture-speech synchronization while preserving naturalness. Finally, we employ the diffusion model to train and infer various gestures. Extensive subjective and objective evaluations on the Trinity, ZEGGS, and BEAT datasets confirm our model's superior performance to the current state-of-the-art approaches. Persona-Gestor improves the system's usability and generalization capabilities, setting a new benchmark in speech-driven gesture synthesis and broadening the horizon for virtual human technology. Supplementary videos and code can be accessed at https://zf223669.github.io/Diffmotion-v2-website/
Paper Structure (27 sections, 5 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 27 sections, 5 equations, 11 figures, 1 table, 2 algorithms.

Figures (11)

  • Figure 1: Each pose depicted is personalized gestures generated solely relying on raw speech audio. Persona-Gestor offers a versatile solution, bypassing complex multimodal processing and thereby enhancing user-friendliness.
  • Figure 2: The Architecture of Persona-Gestor mainly integrates a fuzzy feature extractor and an adaptive layer normalization (AdaLN) transformer diffusion architecture. The fuzzy feature extractor comprises a dual-component framework to comprehensively capture the fuzzy style and detail-oriented audio features. These features, as unified latent features, are subsequently fed into the AdaLN transformer to model the relationship with the accompanist gesture, facilitating the estimation of diffusion noise for the diffusion model. (a) Overall Schematic. (b) Fuzzy Feature Extractor. (c) AdaLN Transformer Block.
  • Figure 3: An overview of the fuzzy inference condition extractor.
  • Figure 4: Samples of gestures corresponding to different emotions. The left side of the subfigure displays ground truth gestures, while the right side showcases gestures generated by our architecture.
  • Figure 5: Samples of gestures corresponding to different personalities. The left side of the subfigure displays ground truth gestures, while the right side showcases gestures generated by our architecture.
  • ...and 6 more figures