Table of Contents
Fetching ...

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros, Thiemo Alldieck, Cristian Sminchisescu

TL;DR

<3-5 sentence high-level summary> This paper introduces VLOGGER, a diffusion-based framework that synthesizes photorealistic, temporally coherent videos of talking and moving humans from a single image, driven by audio or text and capable of full-body motion. It couples a stochastic motion generator from audio with a temporal, control-based diffusion model that uses 2D/3D body cues and warped guidance to render frames, plus a temporal outpainting mechanism to produce variable-length videos and a super-resolution cascade for high-quality outputs. The authors curate MENTOR, a large-scale, diverse dataset with 3D pose/hand annotations and 800k identities, enabling robust training and extensive ablations. Across HDTF, TalkingHead-1KH, and MENTOR benchmarks, VLOGGER achieves state-of-the-art image quality and identity preservation, strong temporal coherence, and notable diversity, while enabling video editing and personalization capabilities. This work advances practical, controllable, and scalable embodied avatar synthesis with potential applications in content creation, education, and personalized interfaces.

Abstract

We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

TL;DR

<3-5 sentence high-level summary> This paper introduces VLOGGER, a diffusion-based framework that synthesizes photorealistic, temporally coherent videos of talking and moving humans from a single image, driven by audio or text and capable of full-body motion. It couples a stochastic motion generator from audio with a temporal, control-based diffusion model that uses 2D/3D body cues and warped guidance to render frames, plus a temporal outpainting mechanism to produce variable-length videos and a super-resolution cascade for high-quality outputs. The authors curate MENTOR, a large-scale, diverse dataset with 3D pose/hand annotations and 800k identities, enabling robust training and extensive ablations. Across HDTF, TalkingHead-1KH, and MENTOR benchmarks, VLOGGER achieves state-of-the-art image quality and identity preservation, strong temporal coherence, and notable diversity, while enabling video editing and personalization capabilities. This work advances practical, controllable, and scalable embodied avatar synthesis with potential applications in content creation, education, and personalized interfaces.

Abstract

We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.
Paper Structure (31 sections, 2 equations, 7 figures, 4 tables)

This paper contains 31 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: VLOGGER is a novel framework to synthesize humans from audio. Given a single input image like the ones shown on the first column, and a sample audio input, our method generates photorealistic and temporally coherent videos of the person talking and vividly moving. As seen on the synthesized images in the right columns, we generate head motion, gaze, blinking, lip movement and unlike previous methods, upper-body and hand gestures, thus taking audio-driven synthesis one step further.
  • Figure 2: High-level overview. VLOGGER conditions the video generation process using a statistical 3D body model. Given an input image $\mathbf{I}_{\mathbf{ref}}$ (left), the predicted shape parameters encode the geometric properties of the target identity. First, a network $M$ takes the Mel-Spectrogram $\mathbf{a}$ of an input speech and generates a sequence of 3D facial expressions $\left\{ \mathbf{\theta}^{e}_{i} \right\}_{1 \leq i \leq N}$ and body poses $\left\{ \mathbf{\theta}^{b}_{i} \right\}_{1 \leq i \leq N}$ for $N$ frames. We render dense representations of the moving 3D body to act as 2D controls $\left\{ \mathbf{C}_{i} \right\}_{1 \leq i \leq N}$ in the video generation stage (examples of controls in Sup. Mat.). Together with the reference image of the subject, these are given as input to a temporal diffusion model and a super-resolution module, which are trained to generate a sequence of photorealistic reenactments $\left\{ \mathbf{G}_{i} \right\}_{1 \leq i \leq N}$ of the target identity. Implementation details in Sup. Mat.
  • Figure 3: Our model and closest competitors across different perceived attributes, such as skin tone, gender and age, on the test set of the MENTOR dataset. Our model leverages priors from large pre-trained diffusion models and our proposed large-scale dataset. Thus, in contrast to other methods, it manages to perform consistently across all categories, showing little to no bias. We also show in \ref{['fig:diversity_attributes']} that our model is capable of animating humans in images at a wide range of viewpoints, instead of cropping tight bounding boxes around the face.
  • Figure 4: Qualitative comparison showing input images (left) and generated frames. Baselines typically maintain the expression along the whole sequence, and require cropping the head sadtalkerstyletalkwang2022one. In contrast, VLOGGER generates changes in the visible areas when considering faces (third row) but also visible upper-body (fifth row). This figure shows animated faces, but examples with gestures are shown in \ref{['fig:teaser']} and Sup. Mat.
  • Figure 5: Showcasing model diversity. VLOGGER is stochastic and can generate a variety of videos for the same subject. Given the subject images and an input speech, columns 2-5 show the deviation in pixel color after 1-4 seconds respectively, obtained from 24 generated videos. After only one second (second col.) the model already shows great diversity in hand pose and facial expressions, with all videos of good visual quality.
  • ...and 2 more figures