Table of Contents
Fetching ...

LLAniMAtion: LLAMA Driven Gesture Animation

Jonathan Windle, Iain Matthews, Sarah Taylor

TL;DR

This work introduces Llanimation, a gesture-generation approach driven primarily by Llama2 text embeddings, and shows these embeddings can outperform audio-based features in producing co-speech gestures, including beat and semantic gestures, even without audio input. The authors implement a cross-attentive Transformer-XL architecture and evaluate four feature configurations (text-only, audio-only, and two multimodal variants) on the genea challenge dataset, using objective metrics (FGD, FD_k, BA) and a human user study. Key findings show that Llama2-based features yield more realistic and contextually appropriate gestures than PASE+ audio features, and combining modalities does not significantly improve performance. The study also compares Llanimation to ground truth and the csmp diffusion baseline, finding Llanimation competitive and sometimes superior in perceptual metrics, with the broader implication that semantic encodings from LLMs can greatly enhance gesture animation practicality and realism.

Abstract

Co-speech gesturing is an important modality in conversation, providing context and social cues. In character animation, appropriate and synchronised gestures add realism, and can make interactive agents more engaging. Historically, methods for automatically generating gestures were predominantly audio-driven, exploiting the prosodic and speech-related content that is encoded in the audio signal. In this paper we instead experiment with using LLM features for gesture generation that are extracted from text using LLAMA2. We compare against audio features, and explore combining the two modalities in both objective tests and a user study. Surprisingly, our results show that LLAMA2 features on their own perform significantly better than audio features and that including both modalities yields no significant difference to using LLAMA2 features in isolation. We demonstrate that the LLAMA2 based model can generate both beat and semantic gestures without any audio input, suggesting LLMs can provide rich encodings that are well suited for gesture generation.

LLAniMAtion: LLAMA Driven Gesture Animation

TL;DR

This work introduces Llanimation, a gesture-generation approach driven primarily by Llama2 text embeddings, and shows these embeddings can outperform audio-based features in producing co-speech gestures, including beat and semantic gestures, even without audio input. The authors implement a cross-attentive Transformer-XL architecture and evaluate four feature configurations (text-only, audio-only, and two multimodal variants) on the genea challenge dataset, using objective metrics (FGD, FD_k, BA) and a human user study. Key findings show that Llama2-based features yield more realistic and contextually appropriate gestures than PASE+ audio features, and combining modalities does not significantly improve performance. The study also compares Llanimation to ground truth and the csmp diffusion baseline, finding Llanimation competitive and sometimes superior in perceptual metrics, with the broader implication that semantic encodings from LLMs can greatly enhance gesture animation practicality and realism.

Abstract

Co-speech gesturing is an important modality in conversation, providing context and social cues. In character animation, appropriate and synchronised gestures add realism, and can make interactive agents more engaging. Historically, methods for automatically generating gestures were predominantly audio-driven, exploiting the prosodic and speech-related content that is encoded in the audio signal. In this paper we instead experiment with using LLM features for gesture generation that are extracted from text using LLAMA2. We compare against audio features, and explore combining the two modalities in both objective tests and a user study. Surprisingly, our results show that LLAMA2 features on their own perform significantly better than audio features and that including both modalities yields no significant difference to using LLAMA2 features in isolation. We demonstrate that the LLAMA2 based model can generate both beat and semantic gestures without any audio input, suggesting LLMs can provide rich encodings that are well suited for gesture generation.
Paper Structure (31 sections, 4 equations, 5 figures, 4 tables)

This paper contains 31 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Extracting text features using Llama2. The text is BPE-tokenised, and a Llama2 embedding is computed for each token. These embeddings are aligned with audio at 30fps by repeating frames as necessary.
  • Figure 2: Overview of Llanimation method. Our model takes Llama2 features as input, along with a speaker embedding and optional pase features that encode the speech of a main-agent and an interlocutor. The features are combined and processed through a cross-attentive Transformer-XL model that produces gesture animation for the main-agent.
  • Figure 3: Generated gestures for given audio beats using Llanimation method. Using a 1.5s audio clip from the test dataset, we show the audio spectrogram, as well as aligned audio beat onsets and their corresponding onset strengths, as well as motion gesture onset detection of the left wrist using the method of beat detection defined in liu2022beat. The speaker moves their left hand from right to left and back again as the syllables are stressed.
  • Figure 4: Example laughter sequence generated using the Llanimation method
  • Figure 5: Example nod motion temporally aligned with the word "yes" being spoken. from a test sequence generated using the Llanimation