FastTalker: Jointly Generating Speech and Conversational Gestures from Text
Zixin Guo, Jian Zhang
TL;DR
FastTalker addresses the gap in real-time, joint generation of speech and co-speech gestures from text. It introduces an end-to-end framework that reuses intermediate TTS features for gesture decoding, employs a causal architecture, and leverages reinforcement-learning NAS to optimize both performance and inference speed. Evaluated on BEAT2, FastTalker achieves state-of-the-art gesture quality while delivering remarkable speed (0.17s per second on a 3090), enabling real-time talking avatars. The approach demonstrates that shared rhythmic predictors and a carefully searched architecture can substantially improve speech-gesture alignment and efficiency, with limitations in voice timbre control identified for future work via voice conversion integration.
Abstract
Generating 3D human gestures and speech from a text script is critical for creating realistic talking avatars. One solution is to leverage separate pipelines for text-to-speech (TTS) and speech-to-gesture (STG), but this approach suffers from poor alignment of speech and gestures and slow inference times. In this paper, we introduce FastTalker, an efficient and effective framework that simultaneously generates high-quality speech audio and 3D human gestures at high inference speeds. Our key insight is reusing the intermediate features from speech synthesis for gesture generation, as these features contain more precise rhythmic information than features re-extracted from generated speech. Specifically, 1) we propose an end-to-end framework that concurrently generates speech waveforms and full-body gestures, using intermediate speech features such as pitch, onset, energy, and duration directly for gesture decoding; 2) we redesign the causal network architecture to eliminate dependencies on future inputs for real applications; 3) we employ Reinforcement Learning-based Neural Architecture Search (NAS) to enhance both performance and inference speed by optimizing our network architecture. Experimental results on the BEAT2 dataset demonstrate that FastTalker achieves state-of-the-art performance in both speech synthesis and gesture generation, processing speech and gestures in 0.17 seconds per second on an NVIDIA 3090.
