Table of Contents
Fetching ...

ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

TL;DR

It is found that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head, and outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency.

Abstract

Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: https://njust-yang.github.io/ConsistentAvatar.github.io/

ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

TL;DR

It is found that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head, and outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency.

Abstract

Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: https://njust-yang.github.io/ConsistentAvatar.github.io/

Paper Structure

This paper contains 17 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview. ConsistentAvatar begins with the implementation of the highly efficient INSTA insta method, leveraging its outputs as initial results (Stage 1). To address temporal consistency, we introduce a concept termed as Temporally-Sensitive Detail (TSD), derived through Fourier transformation. Extracting TSD from the coarse RGB output of INSTA and the target video frame, we develop a temporal consistency diffusion model to accurately align the input TSD with the precise one (Stage 2). Subsequently, we employ the coarse normal output of INSTA as a parameter for 3D perception and introduce an emotion selection module to generate emotion embeddings for each frame. By integrating aligned TSD, normal, and emotion embeddings as conditioning factors, we propose a fully consistent diffusion model to generate the final avatars (Stage 3).
  • Figure 2: Emotion text selection module diagram.
  • Figure 3: Qualitative Results. Clearly, the facial avatars reconstructed by our method exhibit accurate and lifelike details, including intricate features such as wrinkles and eyes. Other methods produce excessively smooth results.
  • Figure 4: Comparison with SadTalker in terms of 3D consistency.
  • Figure 5: Comparison with different state-of-the-art methods in terms of expression consistency.
  • ...and 3 more figures