Table of Contents
Fetching ...

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

Sicheng Yang, Zunnan Xu, Haiwei Xue, Yongkang Cheng, Shaoli Huang, Mingming Gong, Zhiyong Wu

TL;DR

FreeTalker presents a diffusion-based framework to generate both spontaneous co-speech gestures and non-spontaneous motions for talking avatars, trained on heterogeneous motion datasets. It unifies SMPL-X–based motion processing with a diffusion denoiser conditioned on text and audio, employing classifier-free guidance and DoubleTake to achieve controllable, long-range motion with smooth transitions. The approach achieves competitive objective metrics and favorable human judgments, demonstrating improved naturalness and flexibility over prior single-task models. This work advances realistic talking avatars and sets the stage for scalable, data-rich digital humans in interactive settings.

Abstract

Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tackle these issues, we introduce FreeTalker, which, to the best of our knowledge, is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions. Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions, utilizing heterogeneous data sourced from various motion datasets. During inference, we utilize classifier-free guidance to highly control the style in the clips. Additionally, to create smooth transitions between clips, we utilize DoubleTake, a method that leverages a generative prior and ensures seamless motion blending. Extensive experiments show that our method generates natural and controllable speaker movements. Our code, model, and demo are are available at \url{https://youngseng.github.io/FreeTalker/}.

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

TL;DR

FreeTalker presents a diffusion-based framework to generate both spontaneous co-speech gestures and non-spontaneous motions for talking avatars, trained on heterogeneous motion datasets. It unifies SMPL-X–based motion processing with a diffusion denoiser conditioned on text and audio, employing classifier-free guidance and DoubleTake to achieve controllable, long-range motion with smooth transitions. The approach achieves competitive objective metrics and favorable human judgments, demonstrating improved naturalness and flexibility over prior single-task models. This work advances realistic talking avatars and sets the stage for scalable, data-rich digital humans in interactive settings.

Abstract

Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tackle these issues, we introduce FreeTalker, which, to the best of our knowledge, is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions. Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions, utilizing heterogeneous data sourced from various motion datasets. During inference, we utilize classifier-free guidance to highly control the style in the clips. Additionally, to create smooth transitions between clips, we utilize DoubleTake, a method that leverages a generative prior and ensures seamless motion blending. Extensive experiments show that our method generates natural and controllable speaker movements. Our code, model, and demo are are available at \url{https://youngseng.github.io/FreeTalker/}.
Paper Structure (14 sections, 2 equations, 3 figures, 1 table)

This paper contains 14 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: (Top) Denoising module. A noising step $t$ and a noisy motion sequence $x_t$ at this noising step conditioning on $c$ (including text description and audio) are fed into the model. PE indicates the addition of a positional encoding. (Bottom) Sample module. We predict the $\hat{x}_0$ with the denoising process, then add the noise to the noising step $x_{t-1}$ with the diffuse process. This process is repeated from $t$ = $T$ until $t=0$.
  • Figure 2: Visualization of FreeTalker generation. We can control the speaker's non-spontaneous motion through text, while the speaker generates spontaneous co-speech gestures from speech. The light yellow color indicates the model's ability to smoothly transition between motion segments.
  • Figure 3: Visualization of style editing (non-spontaneous motion control) based on co-speech gestures. From top to bottom, generated motions gradually transition from text description-based control to spontaneous co-speech gestures based on speech, resulting in highly controllable gestures.