Table of Contents
Fetching ...

VRMM: A Volumetric Relightable Morphable Head Model

Haotian Yang, Mingwu Zheng, Chongyang Ma, Yu-Kun Lai, Pengfei Wan, Haibin Huang

TL;DR

VRMM introduces a volumetric morphable head model with disentangled, low-dimensional codes for identity $z_{id}$, expression $z_{e}$, and illumination $l$, trained in a self-supervised framework on dynamic multi-view data. Built on Mixture of Volumetric Primitives and a physically inspired relighting decoder, VRMM jointly learns a multi-identity head with decoders for mesh, identity, transformation, opacity, and color, employing a detach-concatenate strategy to stabilize training. A novel disentangled training regime, including an expression-consistency loss $\mathcal{L}_{exp}$ and KL regularization $\mathcal{L}_{KLD}$, enables robust, relightable, and animatable avatar reconstruction from few-shot inputs, with a prior-preserving fine-tuning stage to mitigate overfitting. Extensive experiments on a 254-subject dataset show state-of-the-art performance for novel view synthesis and single-view reconstruction, and demonstrate effective avatar personalization and relighting across scenes. VRMM thus provides a scalable, practical pathway to high-fidelity, controllable 3D facial avatars for applications in avatar creation, animation, and telepresence.

Abstract

In this paper, we introduce the Volumetric Relightable Morphable Model (VRMM), a novel volumetric and parametric facial prior for 3D face modeling. While recent volumetric prior models offer improvements over traditional methods like 3D Morphable Models (3DMMs), they face challenges in model learning and personalized reconstructions. Our VRMM overcomes these by employing a novel training framework that efficiently disentangles and encodes latent spaces of identity, expression, and lighting into low-dimensional representations. This framework, designed with self-supervised learning, significantly reduces the constraints for training data, making it more feasible in practice. The learned VRMM offers relighting capabilities and encompasses a comprehensive range of expressions. We demonstrate the versatility and effectiveness of VRMM through various applications like avatar generation, facial reconstruction, and animation. Additionally, we address the common issue of overfitting in generative volumetric models with a novel prior-preserving personalization framework based on VRMM. Such an approach enables high-quality 3D face reconstruction from even a single portrait input. Our experiments showcase the potential of VRMM to significantly enhance the field of 3D face modeling.

VRMM: A Volumetric Relightable Morphable Head Model

TL;DR

VRMM introduces a volumetric morphable head model with disentangled, low-dimensional codes for identity , expression , and illumination , trained in a self-supervised framework on dynamic multi-view data. Built on Mixture of Volumetric Primitives and a physically inspired relighting decoder, VRMM jointly learns a multi-identity head with decoders for mesh, identity, transformation, opacity, and color, employing a detach-concatenate strategy to stabilize training. A novel disentangled training regime, including an expression-consistency loss and KL regularization , enables robust, relightable, and animatable avatar reconstruction from few-shot inputs, with a prior-preserving fine-tuning stage to mitigate overfitting. Extensive experiments on a 254-subject dataset show state-of-the-art performance for novel view synthesis and single-view reconstruction, and demonstrate effective avatar personalization and relighting across scenes. VRMM thus provides a scalable, practical pathway to high-fidelity, controllable 3D facial avatars for applications in avatar creation, animation, and telepresence.

Abstract

In this paper, we introduce the Volumetric Relightable Morphable Model (VRMM), a novel volumetric and parametric facial prior for 3D face modeling. While recent volumetric prior models offer improvements over traditional methods like 3D Morphable Models (3DMMs), they face challenges in model learning and personalized reconstructions. Our VRMM overcomes these by employing a novel training framework that efficiently disentangles and encodes latent spaces of identity, expression, and lighting into low-dimensional representations. This framework, designed with self-supervised learning, significantly reduces the constraints for training data, making it more feasible in practice. The learned VRMM offers relighting capabilities and encompasses a comprehensive range of expressions. We demonstrate the versatility and effectiveness of VRMM through various applications like avatar generation, facial reconstruction, and animation. Additionally, we address the common issue of overfitting in generative volumetric models with a novel prior-preserving personalization framework based on VRMM. Such an approach enables high-quality 3D face reconstruction from even a single portrait input. Our experiments showcase the potential of VRMM to significantly enhance the field of 3D face modeling.
Paper Structure (33 sections, 16 equations, 18 figures, 3 tables)

This paper contains 33 sections, 16 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: We present VRMM, a novel volumetric head prior with fully disentangled low-dimensional parametric space for identity, expression, and illumination. Trained on dynamic expressions of hundreds of people captured in a LightStage with controllable illumination, our VRMM enables high-quality animatable and relightable avatar reconstruction from few-shot observations.
  • Figure 2: The VRMM pipeline. Network architecture (left): VRMM accepts inputs of identity code $z_{id}$, expression code $z_{e}$, view direction $\mathbf{d}$, and environmental light $l$. The output, comprising a base mesh and volumetric primitives, is generated by respective decoders and rendered into an image in real-time. Notably, the transformation decoder $\mathcal{D}_{T}$, opacity decoder $\mathcal{D}_{\alpha}$, and the non-linear branch of the relightable appearance decoder $\mathcal{D}_{rgb}$ are interconnected through a detach-concatenation process between blocks, a key factor we found for achieving stable results. Training Framework (right): Our framework jointly trains the expression encoder $\mathcal{E}_{e}$, transformation encoder $\mathcal{E}_{T}$, per-person identity codes $z_{id}$, and the decoders in VRMM. Additionally, we incorporate a novel expression consistency loss $\mathcal{L}_{exp}$ to enhance the semantic alignment of expression codes.
  • Figure 3: Our model allows real-time global illumination. The lighting condition is represented as latitude-longitude environment maps, which is shown on the top.
  • Figure 4: Qualitative comparison results on novel view synthesis. Our method produces more faithful results compared to existing parametric head models MoFaNeRF zhuang2022mofanerf and HeadNeRF hong2022headnerf.
  • Figure 5: Interpolation results between three identities (left, center, right). Our model learns a smooth identity latent space that allows linear interpolation. Besides, the expression keeps unchanged during the interpolation, confirming that the expression and identity spaces have been effectively disentangled.
  • ...and 13 more figures