Table of Contents
Fetching ...

Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

Uttaran Bhattacharya, Elizabeth Childs, Nicholas Rewkowski, Dinesh Manocha

TL;DR

An affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features is designed and used in both the generator and the discriminator to guide the gesture synthesis.

Abstract

We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences. We leverage the Mel-frequency cepstral coefficients and the text transcript computed from the input speech in separate encoders in our generator to learn the desired sentiments and the associated affective cues. We design an affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features. We use our affective encoder in both our generator, where it learns affective features from the seed poses to guide the gesture synthesis, and our discriminator, where it enforces the synthesized gestures to contain the appropriate affective expressions. We perform extensive evaluations on two benchmark datasets for gesture synthesis from the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared to the best baselines, we improve the mean absolute joint error by 10--33%, the mean acceleration difference by 8--58%, and the Fréchet Gesture Distance by 21--34%. We also conduct a user study and observe that compared to the best current baselines, around 15.28% of participants indicated our synthesized gestures appear more plausible, and around 16.32% of participants felt the gestures had more appropriate affective expressions aligned with the speech.

Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

TL;DR

An affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features is designed and used in both the generator and the discriminator to guide the gesture synthesis.

Abstract

We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences. We leverage the Mel-frequency cepstral coefficients and the text transcript computed from the input speech in separate encoders in our generator to learn the desired sentiments and the associated affective cues. We design an affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features. We use our affective encoder in both our generator, where it learns affective features from the seed poses to guide the gesture synthesis, and our discriminator, where it enforces the synthesized gestures to contain the appropriate affective expressions. We perform extensive evaluations on two benchmark datasets for gesture synthesis from the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared to the best baselines, we improve the mean absolute joint error by 10--33%, the mean acceleration difference by 8--58%, and the Fréchet Gesture Distance by 21--34%. We also conduct a user study and observe that compared to the best current baselines, around 15.28% of participants indicated our synthesized gestures appear more plausible, and around 16.32% of participants felt the gestures had more appropriate affective expressions aligned with the speech.

Paper Structure

This paper contains 22 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: We synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. We extract the affective cues from the speech, the sentiments from the corresponding text transcripts, the individual speaker styles, and the joint-based affective expressions from the seed poses (shown on the left). We train a generative adversarial network to synthesize gestures aligned with the speech by leveraging the affective information in both the generation and the discrimination phases. We show two such affective gestures on the right, with the affects furious and appalled denoted in italics.
  • Figure 2: Our network consists of a generator (pale-green box) and a discriminator (pale-blue box). Our generator takes in the MFCC from the speech, the text transcript, the speaker ID, and a sequence of 3D seed poses. We use four encoders: the MFCC encoder (Sec. \ref{['subsubsec:mfcc_encoder']}), the text encoder (Sec. \ref{['subsubsec:text_encoder']}), the speaker encoder (Sec. \ref{['subsubsec:speaker_encoder']}), and the affective encoder (Sec. \ref{['subsubsec:aff_encoder']}). We feed the concatenation of these latent features into our Bi-GRU followed by a set of FC layers to synthesize the gestures aligned with the speech. Our discriminator learns to discriminate between the real and the synthesized gestures based on the latent affective features from the affective encoder, constraining the generator to synthesize appropriate affective expressions.
  • Figure 3: Qualitative results on the gestures synthesized by our method for two sample speech excerpts from the TED Gesture Dataset cospeech_gestures. The italicized words very excited and bored indicate the primary affect in the corresponding speeches. We compare with the corresponding gestures of the original speakers, the output of GTC trimodal, and that of the two ablated versions of our network (Sec. \ref{['subsec:ablation']}). See Sec. \ref{['subsec:qualitative']} for a detailed discussion of the results.
  • Figure 4: Mean fraction of participant responses on each point of the Likert scales across the 12 speech excerpts from the TED Gesture Dataset cospeech_gestures and the corresponding gestures in our user study. See Sec. \ref{['subsec:user_study']} for details.