SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis
Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Ziqiang Dang, Jianqiang Ren, Liefeng Bo, Zhigang Tu
TL;DR
SemTalk addresses holistic co-speech motion generation by explicitly decomposing motion into rhythm-related base motion and frame-level semantic-emphasized sparse motion, then adaptively fusing them via a learned semantic score. It introduces RVQ-VAE-based body-part encoding, a hierarchical coarse2fine cross-attention for base motion, and semantic emphasis through a sem-gate guided by frame-level cues and multi-modal features, with rhythmic and semantic losses reinforcing alignment. The approach achieves state-of-the-art results on BEAT2 and SHOW, delivering motions that are rhythmically coherent and semantically richer, validated by quantitative metrics and user studies. This work advances naturalistic, contextually meaningful co-speech gestures with practical impact for avatars, virtual agents, and human–computer interaction systems.
Abstract
A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn base motions and sparse motions, and then adaptively fuse them. In particular, coarse2fine cross-attention module and rhythmic consistency learning are explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.
