Table of Contents
Fetching ...

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard

TL;DR

This work tackles audio-driven generation of photorealistic conversational avatars that animate face, body, and hands in dyadic interactions. It introduces a two-branch architecture: a diffusion-based face model conditioned on audio and lip geometry, and a diffusion-based body model guided by autoregressively generated coarse poses from a residual VQ-VAE both conditioned on audio, enabling high-frequency, diverse gestures synchronized to speech. A novel multi-view dyadic conversational dataset and a subject-specific photorealistic renderer support training and evaluation, with metrics capturing realism and diversity in both geometry and kinetics. Experiments show significant improvements over diffusion-only and VQ-only baselines, and perceptual tests demonstrate photoreal renders better reveal subtle gestural nuances, underscoring the value of photorealism for evaluating and deploying conversational avatars.

Abstract

We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

TL;DR

This work tackles audio-driven generation of photorealistic conversational avatars that animate face, body, and hands in dyadic interactions. It introduces a two-branch architecture: a diffusion-based face model conditioned on audio and lip geometry, and a diffusion-based body model guided by autoregressively generated coarse poses from a residual VQ-VAE both conditioned on audio, enabling high-frequency, diverse gestures synchronized to speech. A novel multi-view dyadic conversational dataset and a subject-specific photorealistic renderer support training and evaluation, with metrics capturing realism and diversity in both geometry and kinetics. Experiments show significant improvements over diffusion-only and VQ-only baselines, and perceptual tests demonstrate photoreal renders better reveal subtle gestural nuances, underscoring the value of photorealism for evaluating and deploying conversational avatars.

Abstract

We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.
Paper Structure (24 sections, 5 equations, 8 figures, 2 tables)

This paper contains 24 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Synthesizing photoreal conversational avatars. Given the audio from a dyadic conversation, we generate realistic conversational motion for the face, body, and hands. The motion can then be rendered as a photorealistic video. Please see https://youtu.be/Y0GMaMtUynQ.
  • Figure 2: Importance of photorealism Top: Mesh annotations from prior work yi2023generating. Bottom: Our photorealistic renderings. For the mesh, differences in laughing (top left) vs. speaking (top right) are difficult to perceive. In contrast, photorealism allows us to capture subtle details such as the smirk (bottom left) vs. grimace (bottom right), which completely changes the perception of her current mood despite similar coarse body poses.
  • Figure 3: Method Overview Our method takes as input conversational audio and generates corresponding face codes and body-hand poses. The output motion is then fed into our trained avatar renderer, which generates a photorealistic video. For details on the face/pose models, please see Figure \ref{['fig:method']}.
  • Figure 4: Motion generation (a) Given conversational audio $\mathbf{A}$, we generate facial motion $\mathbf{F}$ using a diffusion network conditioned on both audio and the output of a lip regression network $\mathbf{L}$, which predicts synced lip geometry from speech audio. (b) For the body-hand poses, we first autoregressively generate guide poses $\mathbf{P}$ at a low fps using a VQ-Transformer. (c) The pose diffusion model then uses these guide poses and audio to produce a high-frequency motion sequence $\mathbf{J}$.
  • Figure 5: Diversity of guide pose rollouts Given the input audio for the conversation (predicted person's audio in gold), the transformer $\mathcal{P}$ generates diverse samples of guide pose sequences with variations in listening reactions (top), speech gestures (middle), and interjections (bottom). Sampling from a rich codebook of learned poses, $\mathcal{P}$ can produce "extreme" poses e.g. pointing, itching, clapping, etc. with high diversity across different samples. These diverse poses are then used to condition the body diffusion model $\mathcal{J}$.
  • ...and 3 more figures