Table of Contents
Fetching ...

MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation

Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela S. M. Lau, Guosheng Yin, Xiu Li

TL;DR

MVPortrait tackles text-guided multi-view portrait animation by introducing FLAME as a common intermediate representation and a two-stage pipeline: Text2FLAME, which separately learns MotionDM and EmotionDM to map text to FLAME pose and expression sequences, and FLAME2Video, which renders multi-view videos conditioned on reference imagery and FLAME renderings. The framework leverages a Reference UNet, a FLAME encoder, and a view-attention-equipped diffusion model to enforce appearance fidelity, temporal coherence, and cross-view consistency, enabling text, audio, and video as driving signals. Extensive experiments on CelebV-Text and RenderMe-360 show MVPortrait outperforms baselines in motion and emotion control as well as multi-view identity preservation, with ablations validating the necessity of distinct MotionDM/EmotionDM training and view attention. The work advances practical, controllable portrait animation with broad signal compatibility, though it acknowledges limitations in text annotation accuracy and micro-expressions, pointing to future refinements in fine-grained expression control.

Abstract

Recent portrait animation methods have made significant strides in generating realistic lip synchronization. However, they often lack explicit control over head movements and facial expressions, and cannot produce videos from multiple viewpoints, resulting in less controllable and expressive animations. Moreover, text-guided portrait animation remains underexplored, despite its user-friendly nature. We present a novel two-stage text-guided framework, MVPortrait (Multi-view Vivid Portrait), to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. MVPortrait is the first to introduce FLAME as an intermediate representation, effectively embedding facial movements, expressions, and view transformations within its parameter space. In the first stage, we separately train the FLAME motion and emotion diffusion models based on text input. In the second stage, we train a multi-view video generation model conditioned on a reference portrait image and multi-view FLAME rendering sequences from the first stage. Experimental results exhibit that MVPortrait outperforms existing methods in terms of motion and emotion control, as well as view consistency. Furthermore, by leveraging FLAME as a bridge, MVPortrait becomes the first controllable portrait animation framework that is compatible with text, speech, and video as driving signals.

MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation

TL;DR

MVPortrait tackles text-guided multi-view portrait animation by introducing FLAME as a common intermediate representation and a two-stage pipeline: Text2FLAME, which separately learns MotionDM and EmotionDM to map text to FLAME pose and expression sequences, and FLAME2Video, which renders multi-view videos conditioned on reference imagery and FLAME renderings. The framework leverages a Reference UNet, a FLAME encoder, and a view-attention-equipped diffusion model to enforce appearance fidelity, temporal coherence, and cross-view consistency, enabling text, audio, and video as driving signals. Extensive experiments on CelebV-Text and RenderMe-360 show MVPortrait outperforms baselines in motion and emotion control as well as multi-view identity preservation, with ablations validating the necessity of distinct MotionDM/EmotionDM training and view attention. The work advances practical, controllable portrait animation with broad signal compatibility, though it acknowledges limitations in text annotation accuracy and micro-expressions, pointing to future refinements in fine-grained expression control.

Abstract

Recent portrait animation methods have made significant strides in generating realistic lip synchronization. However, they often lack explicit control over head movements and facial expressions, and cannot produce videos from multiple viewpoints, resulting in less controllable and expressive animations. Moreover, text-guided portrait animation remains underexplored, despite its user-friendly nature. We present a novel two-stage text-guided framework, MVPortrait (Multi-view Vivid Portrait), to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. MVPortrait is the first to introduce FLAME as an intermediate representation, effectively embedding facial movements, expressions, and view transformations within its parameter space. In the first stage, we separately train the FLAME motion and emotion diffusion models based on text input. In the second stage, we train a multi-view video generation model conditioned on a reference portrait image and multi-view FLAME rendering sequences from the first stage. Experimental results exhibit that MVPortrait outperforms existing methods in terms of motion and emotion control, as well as view consistency. Furthermore, by leveraging FLAME as a bridge, MVPortrait becomes the first controllable portrait animation framework that is compatible with text, speech, and video as driving signals.

Paper Structure

This paper contains 30 sections, 10 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: The unified pipeline of multi-view portrait animation. Users can obtain FLAME sequences via audio-driven methods like Talkshow yi2023generating or FLAME estimation methods like DECA deca from driver video. This paper focuses on text-driven animation.
  • Figure 2: The overview of MVPortrait. MVPortrait consists of two stages: Text2FLAME and FLAME2Video. In the Text2FLAME stage, a reference FLAME is first estimated from the reference image. The text prompt is divided into motion and emotion descriptions, which are then used by MotionDM and EmotionDM to generate the corresponding pose and expression sequences. These sequences, combined with the reference FLAME's shape, form the FLAME sequence. In the FLAME2Video stage, the reference image, aligned reference FLAME rendering, and multi-view renderings of the FLAME sequence are used as inputs to generate multi-view vivid and consistent animations.
  • Figure 3: (Top) The framework of $DM$. MotionDM and EmotionDM both employ MDM as the backbone. They denoise conditioned on motion or emotion descriptions separately. (Bottom) Sampling process. Given a condition, $DM$ denoises $f_T$ from a Gaussian distribution to obtain the clean motion or emotion $f_0$.
  • Figure 4: Qualitative comparison of text-guided portrait animation. Motion descriptions are highlighted in green, emotion descriptions in yellow, and appearance descriptions (used only by MMVID-interp) in blue. Generated frames are shown sequentially from left to right. Additional examples and video results can be found in supplementary materials.
  • Figure 5: The qualitative comparison of multi-view consistency. We present results from $0^\circ$, $-30^\circ$ and $30^\circ$ perspectives.
  • ...and 7 more figures