Table of Contents
Fetching ...

DynamicAvatars: Accurate Dynamic Facial Avatars Reconstruction and Precise Editing with Diffusion Models

Yangyang Qian, Yuan Sun, Yu Guo

TL;DR

This work presents DynamicAvatars, a dynamic model that generates photorealistic, moving 3D head avatars from video clips and parameters associated with facial positions and expressions, and develops a dynamic editing strategy that selectively utilizes specific training datasets to improve the efficiency and adaptability of the model.

Abstract

Generating and editing dynamic 3D head avatars are crucial tasks in virtual reality and film production. However, existing methods often suffer from facial distortions, inaccurate head movements, and limited fine-grained editing capabilities. To address these challenges, we present DynamicAvatars, a dynamic model that generates photorealistic, moving 3D head avatars from video clips and parameters associated with facial positions and expressions. Our approach enables precise editing through a novel prompt-based editing model, which integrates user-provided prompts with guiding parameters derived from large language models (LLMs). To achieve this, we propose a dual-tracking framework based on Gaussian Splatting and introduce a prompt preprocessing module to enhance editing stability. By incorporating a specialized GAN algorithm and connecting it to our control module, which generates precise guiding parameters from LLMs, we successfully address the limitations of existing methods. Additionally, we develop a dynamic editing strategy that selectively utilizes specific training datasets to improve the efficiency and adaptability of the model for dynamic editing tasks.

DynamicAvatars: Accurate Dynamic Facial Avatars Reconstruction and Precise Editing with Diffusion Models

TL;DR

This work presents DynamicAvatars, a dynamic model that generates photorealistic, moving 3D head avatars from video clips and parameters associated with facial positions and expressions, and develops a dynamic editing strategy that selectively utilizes specific training datasets to improve the efficiency and adaptability of the model.

Abstract

Generating and editing dynamic 3D head avatars are crucial tasks in virtual reality and film production. However, existing methods often suffer from facial distortions, inaccurate head movements, and limited fine-grained editing capabilities. To address these challenges, we present DynamicAvatars, a dynamic model that generates photorealistic, moving 3D head avatars from video clips and parameters associated with facial positions and expressions. Our approach enables precise editing through a novel prompt-based editing model, which integrates user-provided prompts with guiding parameters derived from large language models (LLMs). To achieve this, we propose a dual-tracking framework based on Gaussian Splatting and introduce a prompt preprocessing module to enhance editing stability. By incorporating a specialized GAN algorithm and connecting it to our control module, which generates precise guiding parameters from LLMs, we successfully address the limitations of existing methods. Additionally, we develop a dynamic editing strategy that selectively utilizes specific training datasets to improve the efficiency and adaptability of the model for dynamic editing tasks.

Paper Structure

This paper contains 18 sections, 17 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Demonstration results of our method. DynamicAvatars is a powerful method which has the ability to render photorealistic images from dynamic models with flexible editing skills. With dual tracking method and LLM guiding prompts, we can easily experience the changing ability include avatar's expression, appearance and accessories.
  • Figure 2: The pipeline of our method. Our pipeline is divided into two main stages: the modeling stage and the editing stage. In modeling stage, video clips and FLAME model parameters are used as input. The objective is to train a model consisting of Gaussian splats, regulated by the FLAME model. The output of this stage is a dynamic Gaussian model capable of accurately representing the head avatar. In the editing stage, expression parameters and guiding prompts are provided as input. A Large Language Model (LLM) is employed to refine and enhance the structure of the prompts, while a discriminator improves the quality of the style-edited images generated by the model. The output of this stage consists of rendered images that reflect the applied edits.
  • Figure 3: Pipeline of the Modeling Stage In this stage, we utilize a dual tracking method to maintain the relative positions of Gaussian splats, facilitating the editing process in the subsequent stage. For a given set of video clips ${\{I_i\}}$ and the corresponding FLAME parameter set ${\{F_i}\}$, we employ a Facial Component Identifier (FCI) to identify components within the face images. Semantic masks are then used to label the Gaussian splats contributing to each corresponding facial area. Additionally, we bind the Gaussian splats to the mesh of the FLAME model to preserve the spatial structure of the human face. In our experiments, we ultilize
  • Figure 4: Details of the First Editing Stage For the image $I_{{{p}_{0}}}^{{{t}_{0}}}$ at the baseline time point ${{t}_{0}}$ and baseline camera pose ${{p}_{0}}$, the goal is to edit the selected masked area. To achieve this, we first warp the mask $M_{{{p}_{0}}}^{{{t}_{0}}}$ to $M_{{{p}_{j}}}^{{{t}_{i}}}$ using a mapping network. This step identifies Gaussian splats contributing to the masked area across different time points ${t_i}$ and camera poses ${p_j}$. Subsequently, we render and edit the corresponding images, ensuring the overall quality is preserved with the assistance of a discriminator. In the actual case, we use DALL-E as our conditional diffusion model.
  • Figure 5: Space of our mapping net. It takes target timestep $t$, camera pose $p$ and original mask image $M_{{{p}_{0}}}^{{{t}_{0}}}$ as input, and output the target mask $M_{{{p}}}^{{t}}$. In order to keep the continuity of time and camera pose, we will apply bilinear interpolation on the plane of $tOp$. We train this module by utilizing the mask of the training dataset at different time and poses generated at the first stage.
  • ...and 4 more figures