Table of Contents
Fetching ...

SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Ziqiang Dang, Jianqiang Ren, Liefeng Bo, Zhigang Tu

TL;DR

SemTalk addresses holistic co-speech motion generation by explicitly decomposing motion into rhythm-related base motion and frame-level semantic-emphasized sparse motion, then adaptively fusing them via a learned semantic score. It introduces RVQ-VAE-based body-part encoding, a hierarchical coarse2fine cross-attention for base motion, and semantic emphasis through a sem-gate guided by frame-level cues and multi-modal features, with rhythmic and semantic losses reinforcing alignment. The approach achieves state-of-the-art results on BEAT2 and SHOW, delivering motions that are rhythmically coherent and semantically richer, validated by quantitative metrics and user studies. This work advances naturalistic, contextually meaningful co-speech gestures with practical impact for avatars, virtual agents, and human–computer interaction systems.

Abstract

A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn base motions and sparse motions, and then adaptively fuse them. In particular, coarse2fine cross-attention module and rhythmic consistency learning are explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.

SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

TL;DR

SemTalk addresses holistic co-speech motion generation by explicitly decomposing motion into rhythm-related base motion and frame-level semantic-emphasized sparse motion, then adaptively fusing them via a learned semantic score. It introduces RVQ-VAE-based body-part encoding, a hierarchical coarse2fine cross-attention for base motion, and semantic emphasis through a sem-gate guided by frame-level cues and multi-modal features, with rhythmic and semantic losses reinforcing alignment. The approach achieves state-of-the-art results on BEAT2 and SHOW, delivering motions that are rhythmically coherent and semantically richer, validated by quantitative metrics and user studies. This work advances naturalistic, contextually meaningful co-speech gestures with practical impact for avatars, virtual agents, and human–computer interaction systems.

Abstract

A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn base motions and sparse motions, and then adaptively fuse them. In particular, coarse2fine cross-attention module and rhythmic consistency learning are explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.

Paper Structure

This paper contains 13 sections, 5 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: On the left, we analyze semantic labels from the BEAT2 dataset liu2024emage and visualize frame-level motion, revealing that semantically relevant motions are rare and sparse, aligning with real-life observations. On the right, this observation drives the design of SemTalk, which establishes a rhythm-aligned base motion and dynamically emphasizes sparse semantic gestures at the frame-level. In this example, SemTalk amplifies expressiveness on words like “watching” and “just,” enhancing gesture and torso movements. The semantic scores below are automatically generated by SemTalk to modulate semantic emphasis over time.
  • Figure 2: An overview of the SemTalk pipeline. SemTalk generates holistic co-speech motion by first constructing rhythm-aligned $q^{r}$ in $f_r$, guided by rhythmic consistency loss $L_{\text{Rhy}}$. Meanwhile, $f_s$ produce frame-level semantic codes $q^s$, activated selectively by the semantic score $\psi$. Finally, $q^m$ is achieved by fusing $q^{r}$ and $q^s$ based on $\psi$, with motion decoder, yielding synchronized and contextually enriched motions.
  • Figure 3: Architecture of SemTalk. SemTalk generates holistic co-speech motion in three stages. (a) Base Motion Generation uses rhythmic consistency learning to produce rhythm-aligned codes $q^b$, conditioned on rhythmic features $\gamma_b$, $\gamma_h$. (b) Sparse Motion Generation employs semantic emphasis learning to generate semantic codes $q^s$, activated by semantic score $\psi$. (c) Adaptively Fusion automatically combines $q^b$ and $q^s$ based on $\psi$ to produce mixed codes $q^m$ at frame level for rhythmically aligned and contextually rich motions.
  • Figure 4: Concept comparison with LivelySpeaker zhi2023livelyspeaker. (Top) LivelySpeaker generates semantic gestures with CLIP embeddings in SAG and refines rhythm-related gestures separately using diffusion, causing potential jitter. (Bottom) SemTalk integrates text and speech, uses a semantic gate for fine-grained control, and unifies rhythm and semantics for smoother, more coherent motions.
  • Figure 5: Comparison on BEAT2 liu2024emage Dataset. SemTalk* refers to the model trained solely on the Base Motion Generation stage, capturing rhythmic alignment but lacking semantic gestures. In contrast, SemTalk successfully emphasized sparse yet vivid motions. For instance, when saying “my opinion,” SemTalk generates a hand-raising gesture followed by an index finger extension for emphasis. Similarly, for “never tell,” our model produces a clear, repeated gesture matching the rhythm, reinforcing the intended emphasis.
  • ...and 5 more figures