Table of Contents
Fetching ...

Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, Dinesh Manocha

Abstract

We present Text2Gestures, a transformer-based learning method to interactively generate emotive full-body gestures for virtual agents aligned with natural language text inputs. Our method generates emotionally expressive gestures by utilizing the relevant biomechanical features for body expressions, also known as affective features. We also consider the intended task corresponding to the text and the target virtual agents' intended gender and handedness in our generation pipeline. We train and evaluate our network on the MPI Emotional Body Expressions Database and observe that our network produces state-of-the-art performance in generating gestures for virtual agents aligned with the text for narration or conversation. Our network can generate these gestures at interactive rates on a commodity GPU. We conduct a web-based user study and observe that around 91% of participants indicated our generated gestures to be at least plausible on a five-point Likert Scale. The emotions perceived by the participants from the gestures are also strongly positively correlated with the corresponding intended emotions, with a minimum Pearson coefficient of 0.77 in the valence dimension.

Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

Abstract

We present Text2Gestures, a transformer-based learning method to interactively generate emotive full-body gestures for virtual agents aligned with natural language text inputs. Our method generates emotionally expressive gestures by utilizing the relevant biomechanical features for body expressions, also known as affective features. We also consider the intended task corresponding to the text and the target virtual agents' intended gender and handedness in our generation pipeline. We train and evaluate our network on the MPI Emotional Body Expressions Database and observe that our network produces state-of-the-art performance in generating gestures for virtual agents aligned with the text for narration or conversation. Our network can generate these gestures at interactive rates on a commodity GPU. We conduct a web-based user study and observe that around 91% of participants indicated our generated gestures to be at least plausible on a five-point Likert Scale. The emotions perceived by the participants from the gestures are also strongly positively correlated with the corresponding intended emotions, with a minimum Pearson coefficient of 0.77 in the valence dimension.

Paper Structure

This paper contains 34 sections, 11 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Directed pose graph. Our pose graph is a directed tree consisting of 23 joints, with the root joint as the root node of the tree, and the end-effector joints (head, wrists, toes) as the leaf nodes of the tree. We manipulate the appropriate joints to generate emotive gestures.
  • Figure 2: Text2Gestures Network. Our network takes in sentences of natural language text and transforms them to word embeddings using the pre-trained GloVe model glove. It then uses a transformer encoder to transform the word embeddings to latent representations, appends the agent attributes to these latent representations, and transforms the combined representations into encoded features. The transformer decoder takes in these encoded features and the past gesture history to predict gestures for the subsequent time steps. At each time step, we represent the gesture by the set of rotations on all the body joints relative to their respective parents in the pose graph at that time step.
  • Figure 3: Variance in emotive gestures. Emotions with high arousal (e.g., amused) generally have rapid limb movements, while emotions with low arousal (e.g., sad) generally have slow and subtle limb movements. Emotions with high dominance (e.g., proud) generally have an expanded upper body and spread arms, while emotions with low dominance (e.g., afraid) have a contracted upper body and arms close to the body. Our algorithm uses these characteristics to generate the appropriate gestures.
  • Figure 4: Gesture-based affective features. We use a total of 15 features: 7 angles, $A_1$ through $A_7$, 5 distance ratios, $\frac{D_1}{D_4}$, $\frac{D_2}{D_4}$, $\frac{D_8}{D_5}$, $\frac{D_7}{D_5}$, and $\frac{D_3}{D_6}$, and 3 area ratios, $\frac{R_1}{R_2}$, $\frac{R_3}{R_4}$, and $\frac{R_5}{R_6}$.
  • Figure 5: End-effector trajectories. The trajectories in the three coordinate directions for the head and two wrists. We show two sample sequences from the test set, as generated by all the methods. Removing the angle loss makes the trajectory heavily jerky. Removing the pose loss makes our method unable to follow the desired trajectory. Removing the affective loss reduces the variations corresponding to emotional expressiveness. Yoon et al.'s method cospeech_gestures is unable to generate large amplitude variations in the trajectories because it works with a dimension-reduced representation of the sequences.
  • ...and 3 more figures