Table of Contents
Fetching ...

FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Zixin Guo, Jian Zhang

TL;DR

FastTalker addresses the gap in real-time, joint generation of speech and co-speech gestures from text. It introduces an end-to-end framework that reuses intermediate TTS features for gesture decoding, employs a causal architecture, and leverages reinforcement-learning NAS to optimize both performance and inference speed. Evaluated on BEAT2, FastTalker achieves state-of-the-art gesture quality while delivering remarkable speed (0.17s per second on a 3090), enabling real-time talking avatars. The approach demonstrates that shared rhythmic predictors and a carefully searched architecture can substantially improve speech-gesture alignment and efficiency, with limitations in voice timbre control identified for future work via voice conversion integration.

Abstract

Generating 3D human gestures and speech from a text script is critical for creating realistic talking avatars. One solution is to leverage separate pipelines for text-to-speech (TTS) and speech-to-gesture (STG), but this approach suffers from poor alignment of speech and gestures and slow inference times. In this paper, we introduce FastTalker, an efficient and effective framework that simultaneously generates high-quality speech audio and 3D human gestures at high inference speeds. Our key insight is reusing the intermediate features from speech synthesis for gesture generation, as these features contain more precise rhythmic information than features re-extracted from generated speech. Specifically, 1) we propose an end-to-end framework that concurrently generates speech waveforms and full-body gestures, using intermediate speech features such as pitch, onset, energy, and duration directly for gesture decoding; 2) we redesign the causal network architecture to eliminate dependencies on future inputs for real applications; 3) we employ Reinforcement Learning-based Neural Architecture Search (NAS) to enhance both performance and inference speed by optimizing our network architecture. Experimental results on the BEAT2 dataset demonstrate that FastTalker achieves state-of-the-art performance in both speech synthesis and gesture generation, processing speech and gestures in 0.17 seconds per second on an NVIDIA 3090.

FastTalker: Jointly Generating Speech and Conversational Gestures from Text

TL;DR

FastTalker addresses the gap in real-time, joint generation of speech and co-speech gestures from text. It introduces an end-to-end framework that reuses intermediate TTS features for gesture decoding, employs a causal architecture, and leverages reinforcement-learning NAS to optimize both performance and inference speed. Evaluated on BEAT2, FastTalker achieves state-of-the-art gesture quality while delivering remarkable speed (0.17s per second on a 3090), enabling real-time talking avatars. The approach demonstrates that shared rhythmic predictors and a carefully searched architecture can substantially improve speech-gesture alignment and efficiency, with limitations in voice timbre control identified for future work via voice conversion integration.

Abstract

Generating 3D human gestures and speech from a text script is critical for creating realistic talking avatars. One solution is to leverage separate pipelines for text-to-speech (TTS) and speech-to-gesture (STG), but this approach suffers from poor alignment of speech and gestures and slow inference times. In this paper, we introduce FastTalker, an efficient and effective framework that simultaneously generates high-quality speech audio and 3D human gestures at high inference speeds. Our key insight is reusing the intermediate features from speech synthesis for gesture generation, as these features contain more precise rhythmic information than features re-extracted from generated speech. Specifically, 1) we propose an end-to-end framework that concurrently generates speech waveforms and full-body gestures, using intermediate speech features such as pitch, onset, energy, and duration directly for gesture decoding; 2) we redesign the causal network architecture to eliminate dependencies on future inputs for real applications; 3) we employ Reinforcement Learning-based Neural Architecture Search (NAS) to enhance both performance and inference speed by optimizing our network architecture. Experimental results on the BEAT2 dataset demonstrate that FastTalker achieves state-of-the-art performance in both speech synthesis and gesture generation, processing speech and gestures in 0.17 seconds per second on an NVIDIA 3090.
Paper Structure (26 sections, 11 equations, 5 figures, 2 tables)

This paper contains 26 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Conceptual Comparison. Previous works process (a) text to speech and (b) speech to gesture separately, generating gestures with redundant encoders and inaccurate latent features. We propose a unified model (c) that reuses intermediate features to jointly generate speech and gestures efficiently and effectively.
  • Figure 2: FastTalker first preprocesses text scripts into phoneme and word sequences. For speech generation (top), initial features are encoded by a causal transformer encoder, and then separate rhythm predictors are adopted to estimate intermediate speech duration, pitch, and energy to decode speech. For gesture generation (bottom), it employs shared rhythm predictors to predict gesture rhythm features and fuse them with the speech features via attention. The merged features are concatenated with semantic features extracted from word embeddings for gesture latent decoding. Finally, gestures are reconstructed from the latent features by pretrained VQVAEs.
  • Figure 3: Neural Architecture Search for FastTalker. (a) The architecture of FastTalker is decomposed into four types of basic network blocks including: Transformer Encoder Block (b), CNN Block (c), Gated CNN Block (d), and Transformer Decoder Block (e) for the phoneme/semantic encoder, rhythm translator, speech decoder, and gesture decoder respectively. The hyperparameters of these blocks, such as the number of layers, are optimized with NAS.
  • Figure 4: User Study Win Rate and CMOS. Compared with ground truth, FastSpeech2 ren2020fastspeech + EMAGE liu2023emage, our FastTalker outperforms the separated FastSpeech2 + EMAGE stably with a higher win rate on gesture preference, audio-motion alignment, and gesture diversity. Additionally, the CMOS also demonstrates that FastTalker generates audio samples with similar quality to FastSpeech2.
  • Figure 5: Subjective Generated Gesture Samples. FastTalker not only generates results with a high audio-gesture rhythm alignment, but also semantically aligns with input text sentences. For example, it presents a "reject" hand gesture when talking about the sentence "never".