Table of Contents
Fetching ...

FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models

Stathis Galanakis, Alexandros Lattas, Stylianos Moschoglou, Stefanos Zafeiriou

TL;DR

FitDiff presents a diffusion-based framework for robust monocular 3D facial reconstruction that jointly generates geometry and multi-modal reflectance from a single image. By conditioning a latent diffusion model on an identity embedding from a face recognition network and decoding via LSFM-based shape and VQGAN-based textures, it achieves relightable avatars with strong identity preservation. The method introduces a novel sampling guidance strategy and SPADE-based conditioning within a single-stage model, enabling unconditional sampling as well as identity-guided reconstructions from unconstrained inputs. Empirical results show state-of-the-art or competitive performance in facial reflectance acquisition and identity preservation, while offering robust handling of occlusions and varying illumination. This yields practical benefits for rendering-ready avatars in real-time graphics and virtual production workflows.

Abstract

The remarkable progress in 3D face reconstruction has resulted in high-detail and photorealistic facial representations. Recently, Diffusion Models have revolutionized the capabilities of generative methods by surpassing the performance of GANs. In this work, we present FitDiff, a diffusion-based 3D facial avatar generative model. Leveraging diffusion principles, our model accurately generates relightable facial avatars, utilizing an identity embedding extracted from an "in-the-wild" 2D facial image. The introduced multi-modal diffusion model is the first to concurrently output facial reflectance maps (diffuse and specular albedo and normals) and shapes, showcasing great generalization capabilities. It is solely trained on an annotated subset of a public facial dataset, paired with 3D reconstructions. We revisit the typical 3D facial fitting approach by guiding a reverse diffusion process using perceptual and face recognition losses. Being the first 3D LDM conditioned on face recognition embeddings, FitDiff reconstructs relightable human avatars, that can be used as-is in common rendering engines, starting only from an unconstrained facial image, and achieving state-of-the-art performance.

FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models

TL;DR

FitDiff presents a diffusion-based framework for robust monocular 3D facial reconstruction that jointly generates geometry and multi-modal reflectance from a single image. By conditioning a latent diffusion model on an identity embedding from a face recognition network and decoding via LSFM-based shape and VQGAN-based textures, it achieves relightable avatars with strong identity preservation. The method introduces a novel sampling guidance strategy and SPADE-based conditioning within a single-stage model, enabling unconditional sampling as well as identity-guided reconstructions from unconstrained inputs. Empirical results show state-of-the-art or competitive performance in facial reflectance acquisition and identity preservation, while offering robust handling of occlusions and varying illumination. This yields practical benefits for rendering-ready avatars in real-time graphics and virtual production workflows.

Abstract

The remarkable progress in 3D face reconstruction has resulted in high-detail and photorealistic facial representations. Recently, Diffusion Models have revolutionized the capabilities of generative methods by surpassing the performance of GANs. In this work, we present FitDiff, a diffusion-based 3D facial avatar generative model. Leveraging diffusion principles, our model accurately generates relightable facial avatars, utilizing an identity embedding extracted from an "in-the-wild" 2D facial image. The introduced multi-modal diffusion model is the first to concurrently output facial reflectance maps (diffuse and specular albedo and normals) and shapes, showcasing great generalization capabilities. It is solely trained on an annotated subset of a public facial dataset, paired with 3D reconstructions. We revisit the typical 3D facial fitting approach by guiding a reverse diffusion process using perceptual and face recognition losses. Being the first 3D LDM conditioned on face recognition embeddings, FitDiff reconstructs relightable human avatars, that can be used as-is in common rendering engines, starting only from an unconstrained facial image, and achieving state-of-the-art performance.
Paper Structure (23 sections, 5 equations, 11 figures, 2 tables)

This paper contains 23 sections, 5 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of FitDiff, a diffusion-based 3D facial generative network. Starting from Gaussian noise, our method generates facial avatars with relightable reflectance and shape, conditioned on an identity embedding. During sampling, a novel guidance algorithm ($\mathcal{G}$) is applied for further control of the resulting identity. $\mathbf{Z}_{T}, \mathbf{Z}_{k}$ and $\mathbf{Z}_{k-1}$ are visualized in the actual picture space for illustration purposes.
  • Figure 2: Differences with existing state-of-the-art methods lattas2023fitmeParaperas_2023_ICCV: Prior works rely on multiple separate models, which can fail on challenging inputs (Fig. \ref{['fig:Relightify']}). In contrast, our method uses only a single Latent Diffusion Model for both shape and reflectance texture prediction, achieving simplified architecture, training, and robustness.
  • Figure 3: Qualitative results of FitDiff on "in-the-wild" facial images, showing shape, reflectance, and environment map renderings.
  • Figure 4: Samples generated by FitDiff with unconditional sampling. Our method can generate diverse facial shapes and reflectance maps.
  • Figure 5: Qualitative comparison between FitDiff and other monocular face reconstruction approaches lattas2023fitmeLuo_2021_CVPRlattas2021avatarme++Gecer_2019_CVPRParaperas_2023_ICCV.
  • ...and 6 more figures