Stable Video Portraits

Mirela Ostrek; Justus Thies

Stable Video Portraits

Mirela Ostrek, Justus Thies

TL;DR

Stable Video Portraits (SVP) presents a high-fidelity monocular avatar pipeline that jointly leverages a large 2D diffusion prior and a 3D morphable model to produce temporally coherent, talking-head videos conditioned by 3DMM sequences. The method finetunes Stable Diffusion via ControlNet on a short training video and introduces a temporal denoising scheme that leverages the previous frame to stabilize renderings, while enabling text-driven celebrity morphing without test-time fine-tuning. SVP demonstrates state-of-the-art performance against monocular head-avatar baselines through quantitative metrics such as LPIPS, FID, and KID, and delivers qualitative gains in fine facial details and reliable identity morphing under challenging expressions. This work advances telepresence, AR/VR, and content creation by enabling controllable, morphable, person-specific avatars from monocular video with practical data efficiency and temporal coherence.

Abstract

Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present SVP, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any fine-tuning at test time. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.

Stable Video Portraits

TL;DR

Abstract

Paper Structure (32 sections, 7 equations, 12 figures, 2 tables)

This paper contains 32 sections, 7 equations, 12 figures, 2 tables.

Introduction
Related Work
3D Head Avatars:
2D Generative AI:
Video Diffusion:
Background
Denoising diffusion probabilistic model (DDPM):
Latent diffusion model (LDM):
Conditional LDM:
Methodology
Monocular Head Avatars
Spatio-temporal inference procedure:
Training Stage I:
Training Stage II:
Text-based Celeb Face Morphing
...and 17 more sections

Figures (12)

Figure 1: System Overview: (I) Using Spectre spectre, face parsing maps (FPM) hpm, and Mediapipe mediapipe, the input video is processed to extract per-frame 3D face reconstructions (3DMM), FPM, and the iris location. (II) Based on this data, two ControlNets are trained in parallel, allowing for the generation of temporally stable outlines (Stage I) and inner details (Stage II), resulting in photo-realistic personal avatars (SD is fine-tuned in the unlocked mode). (III) Person-specific avatars may be further morphed into a celebrity via text, without additional fine-tuning (using the locked SD).
Figure 1: Quantitative ablation study: To investigate the effect of each of the newly introduced denoising process parameters, namely $w_n$ (noise term importance), $w_c$ (importance of the current frame), and $w_p$ (importance of the previous frame), we show the results on five standard evaluation metrics including PSNR, SSIM, MSE, LPIPS, FID including our proposed smoothness metric. Darker cells contain higher values.
Figure 2: Spatio-temporal Denoising: Using the prediction for the frame ${f_{n-1}}$, we modify the inference in the DDIM step $t=\tau$ for frame ${f_{n}}$ to consider the previous frame, which leads to temporally smooth outputs, controlled by $w_c,w_p$ and $w_n$.
Figure 2: Data: We have released a portrait avatar dataset that contains 6 long video sequences ($8+$ minutes) of women speaking with head movement, for research purposes.
Figure 3: ControlNet Strength: The lower values give more importance to the celeb ID (as defined by text), but they lead to inconsistent videos and low controllability via 3DMM. The higher values allow for more control while preserving the original identity.
...and 7 more figures

Stable Video Portraits

TL;DR

Abstract

Stable Video Portraits

Authors

TL;DR

Abstract

Table of Contents

Figures (12)