Table of Contents
Fetching ...

Large Body Language Models

Saif Punjwani, Larry Heck

TL;DR

Large Body Language Models (LBLMs) are introduced and LBLM-AVA is presented, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video).

Abstract

As virtual agents become increasingly prevalent in human-computer interaction, generating realistic and contextually appropriate gestures in real-time remains a significant challenge. While neural rendering techniques have made substantial progress with static scripts, their applicability to human-computer interactions remains limited. To address this, we introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video). LBLM-AVA incorporates several key components enhancing its gesture generation capabilities, such as multimodal-to-pose embeddings, enhanced sequence-to-sequence mapping with redefined attention mechanisms, a temporal smoothing module for gesture sequence coherence, and an attention-based refinement module for enhanced realism. The model is trained on our large-scale proprietary open-source dataset Allo-AVA. LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Fréchet Gesture Distance (FGD), and a 25% improvement in Fréchet Inception Distance compared to existing approaches.

Large Body Language Models

TL;DR

Large Body Language Models (LBLMs) are introduced and LBLM-AVA is presented, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video).

Abstract

As virtual agents become increasingly prevalent in human-computer interaction, generating realistic and contextually appropriate gestures in real-time remains a significant challenge. While neural rendering techniques have made substantial progress with static scripts, their applicability to human-computer interactions remains limited. To address this, we introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video). LBLM-AVA incorporates several key components enhancing its gesture generation capabilities, such as multimodal-to-pose embeddings, enhanced sequence-to-sequence mapping with redefined attention mechanisms, a temporal smoothing module for gesture sequence coherence, and an attention-based refinement module for enhanced realism. The model is trained on our large-scale proprietary open-source dataset Allo-AVA. LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Fréchet Gesture Distance (FGD), and a 25% improvement in Fréchet Inception Distance compared to existing approaches.

Paper Structure

This paper contains 31 sections, 16 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Architecture of the proposed LLBM-AVA model. The multimodal inputs are encoded using a Transformer-XL encoder, and a language-to-pose embedding module maps the encoded text to a latent pose space. A parallelized diffusion model generates multiple gesture sequences, which are then refined using an attention-based temporal refinement module. Adversarial training is employed to enhance the realism and diversity of the generated gestures. Post-processing occurs here to optimize gesture accuracy and human-likeness.
  • Figure 2: Representative examples from the Allo-AVA dataset, illustrating the diversity of speakers, contexts, and gestures captured within the corpus. The dataset includes a wide range of communicative scenarios, from formal presentations to casual interviews, and features speakers from various demographic and professional backgrounds.
  • Figure 3: This is the pipeline of the LBLM-AVA model for generating human-like gestures from multimodal inputs. The process begins with the training of the model where the input is the mapping, speech, and the associated text all used to train. Video input is analyzed using Openpose to extract keypoints from the mapping of the gestures. These keypoints are then mapped and fed into the LBLM-AVA model along with the inferred text and audio features. Then, to inference, we can use modalities and generate the output mapping and abstract it into a mesh.
  • Figure 4: Example mesh for rendering and evaluating outputs; done through the UNREAL Metahuman Engine.
  • Figure 5: frequency of TED categories gathered in the dataset. From left to right: [Science and technology, education, people and blogs, nonprofits and activism, how-to and style, comedy, gaming, entertainment, film and animation, music, news and politics, travel and events, sports]
  • ...and 3 more figures