Table of Contents
Fetching ...

LLM Gesticulator: Leveraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis

Haozhou Pang, Tianwei Ding, Lanshan He, Ming Tao, Lu Zhang, Qi Gan

Abstract

In this work, we present LLM Gesticulator, an LLM-based audio-driven co-speech gesture generation framework that synthesizes full-body animations that are rhythmically aligned with the input audio while exhibiting natural movements and editability. Compared to previous work, our model demonstrates substantial scalability. As the size of the backbone LLM model increases, our framework shows proportional improvements in evaluation metrics (a.k.a. scaling law). Our method also exhibits strong controllability where the content, style of the generated gestures can be controlled by text prompt. To the best of our knowledge, LLM gesticulator is the first work that use LLM on the co-speech generation task. Evaluation with existing objective metrics and user studies indicate that our framework outperforms prior works.

LLM Gesticulator: Leveraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis

Abstract

In this work, we present LLM Gesticulator, an LLM-based audio-driven co-speech gesture generation framework that synthesizes full-body animations that are rhythmically aligned with the input audio while exhibiting natural movements and editability. Compared to previous work, our model demonstrates substantial scalability. As the size of the backbone LLM model increases, our framework shows proportional improvements in evaluation metrics (a.k.a. scaling law). Our method also exhibits strong controllability where the content, style of the generated gestures can be controlled by text prompt. To the best of our knowledge, LLM gesticulator is the first work that use LLM on the co-speech generation task. Evaluation with existing objective metrics and user studies indicate that our framework outperforms prior works.

Paper Structure

This paper contains 18 sections, 10 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: LLM gesticulator synthesizes full-body co-speech gestures according to input audio and text prompt.
  • Figure 2: System Overview: Our system consists of two parts. First, we train a Residual VQVAE on motion data to convert it into motion tokens. Then, we use the trained MotionRVQ model to tokenize the motion, and leverage a pre-trained audio tokenizer to convert the audio into tokens. We fine-tune a LLM on these data. The fine-tuned LLM can predict the motion tokens given the audio tokens, and convert them back to a motion sequence through the MotionRVQ Decoder. Text-conditioned generation is achived by adding text tokens in the training pipeline.
  • Figure 3: Since the tokens for audio and motion have not been learned by the LLM, it is necessary to pretrain the audio and motion data.
  • Figure 4: Comparison of motion generation with and without text prompts under the same speech input. The left side demonstrates the editing effect without text prompt, while the right side shows the results achieved by incorporating a text prompt.
  • Figure 5: We bind the motion capture clips from the dataset to characters and use Unity to render them into videos, which are then input to the VLLM to generate textual descriptions of the motions.
  • ...and 4 more figures