Table of Contents
Fetching ...

Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation

Fengqi Liu, Hexiang Wang, Jingyu Gong, Ran Yi, Qianyu Zhou, Xuequan Lu, Jiangbo Lu, Lizhuang Ma

TL;DR

A novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture by introducing a weakly-supervised detector to identify salient postures, and reweighting the consistency loss to focus more on learning the correspondence between salient postures and the high-level semantics of speech content.

Abstract

Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence, ignoring the semantic association of different modalities and failing to deal with salient gestures. In this paper, we propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture. Specifically, we first learn a joint manifold space for the individual representation of audio and body pose to exploit the inherent semantic association between two modalities, and propose to enforce semantic consistency via a consistency loss. Furthermore, we emphasize the semantic consistency of salient postures by introducing a weakly-supervised detector to identify salient postures, and reweighting the consistency loss to focus more on learning the correspondence between salient postures and the high-level semantics of speech content. In addition, we propose to extract audio features dedicated to facial expression and body gesture separately, and design separate branches for face and body gesture synthesis. Extensive experimental results demonstrate the superiority of our method over the state-of-the-art approaches.

Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation

TL;DR

A novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture by introducing a weakly-supervised detector to identify salient postures, and reweighting the consistency loss to focus more on learning the correspondence between salient postures and the high-level semantics of speech content.

Abstract

Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence, ignoring the semantic association of different modalities and failing to deal with salient gestures. In this paper, we propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture. Specifically, we first learn a joint manifold space for the individual representation of audio and body pose to exploit the inherent semantic association between two modalities, and propose to enforce semantic consistency via a consistency loss. Furthermore, we emphasize the semantic consistency of salient postures by introducing a weakly-supervised detector to identify salient postures, and reweighting the consistency loss to focus more on learning the correspondence between salient postures and the high-level semantics of speech content. In addition, we propose to extract audio features dedicated to facial expression and body gesture separately, and design separate branches for face and body gesture synthesis. Extensive experimental results demonstrate the superiority of our method over the state-of-the-art approaches.

Paper Structure

This paper contains 24 sections, 14 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Salient postures indicate the large pose movements associated with the high-level semantics of speech content, e.g., cooling down, which are hard to be generated. Our method can synthesize more realistic gestures than SEEG liang2022seeg by emphasizing semantic consistency of salient postures.
  • Figure 2: The overall architecture of our proposed method, which consists of two branches including body synthesis branch and face synthesis branch. In the body synthesis branch, our model learns a joint manifold space of representations to enforce semantic consistency by employing a dual-path structure, which contains the upper reconstruction path and the lower speech-driven generation path. Furthermore, the salient posture detector is designed to identify salient gestures and reweight the consistency loss. We then generate synchronized facial expressions using face synthesis branch. Finally, we fuse the generated results of two branches to obtain the entire gesture sequence.
  • Figure 3: The detailed structure of proposed salient posture detector. We take as input the real body poses $P^b$ and extract the initial feature $X$ using the ConvNet. Then, $X$ is fed into the temporal relation module to obtain the interaction feature $Y$. We utilize a classifier to map $Y$ to the 1D salient score $S^b$, which is used to reweight the consistency loss of joint manifold training.
  • Figure 4: The detailed structure of the face-body feature alignment module, which is trained by a self-supervised manner.
  • Figure 5: Visualization results of generated gesture sequence of all methods given the speech signal. Our method can synthesize more natural and realistic gestures with better synchrony than others.