Table of Contents
Fetching ...

Total-Editing: Head Avatar with Editable Appearance, Motion, and Lighting

Yizhou Zhao, Chunjiang Liu, Haoyu Chen, Bhiksha Raj, Min Xu, Tadas Baltrusaitis, Mitch Rundle, HsiangTao Wu, Kamran Ghasedi

TL;DR

Total-Editing addresses the joint problem of portrait reenactment and relighting by introducing an intrinsically decomposed NeRF with Phong-based shading, lightmaps, and an MLS deformation framework. By disentangling appearance, motion, and lighting and training on a large synthetic dataset plus real video data, the method achieves superior 3D-aware portrait editing with controllable illumination from either a portrait image or an HDR environment map. Key contributions include the intrinsically decomposed NeRF decoder, the MLS-based deformation for smooth spatiotemporal coherence, and a lightmap-based illumination pathway that enables accurate shading during head motion and lighting transfer. The approach yields higher quality and more flexible results than prior methods, enabling applications such as illumination transfer and background changes in animated portraits, with potential impact for AR/VR, social media, and film production.

Abstract

Face reenactment and portrait relighting are essential tasks in portrait editing, yet they are typically addressed independently, without much synergy. Most face reenactment methods prioritize motion control and multiview consistency, while portrait relighting focuses on adjusting shading effects. To take advantage of both geometric consistency and illumination awareness, we introduce Total-Editing, a unified portrait editing framework that enables precise control over appearance, motion, and lighting. Specifically, we design a neural radiance field decoder with intrinsic decomposition capabilities. This allows seamless integration of lighting information from portrait images or HDR environment maps into synthesized portraits. We also incorporate a moving least squares based deformation field to enhance the spatiotemporal coherence of avatar motion and shading effects. With these innovations, our unified framework significantly improves the quality and realism of portrait editing results. Further, the multi-source nature of Total-Editing supports more flexible applications, such as illumination transfer from one portrait to another, or portrait animation with customized backgrounds.

Total-Editing: Head Avatar with Editable Appearance, Motion, and Lighting

TL;DR

Total-Editing addresses the joint problem of portrait reenactment and relighting by introducing an intrinsically decomposed NeRF with Phong-based shading, lightmaps, and an MLS deformation framework. By disentangling appearance, motion, and lighting and training on a large synthetic dataset plus real video data, the method achieves superior 3D-aware portrait editing with controllable illumination from either a portrait image or an HDR environment map. Key contributions include the intrinsically decomposed NeRF decoder, the MLS-based deformation for smooth spatiotemporal coherence, and a lightmap-based illumination pathway that enables accurate shading during head motion and lighting transfer. The approach yields higher quality and more flexible results than prior methods, enabling applications such as illumination transfer and background changes in animated portraits, with potential impact for AR/VR, social media, and film production.

Abstract

Face reenactment and portrait relighting are essential tasks in portrait editing, yet they are typically addressed independently, without much synergy. Most face reenactment methods prioritize motion control and multiview consistency, while portrait relighting focuses on adjusting shading effects. To take advantage of both geometric consistency and illumination awareness, we introduce Total-Editing, a unified portrait editing framework that enables precise control over appearance, motion, and lighting. Specifically, we design a neural radiance field decoder with intrinsic decomposition capabilities. This allows seamless integration of lighting information from portrait images or HDR environment maps into synthesized portraits. We also incorporate a moving least squares based deformation field to enhance the spatiotemporal coherence of avatar motion and shading effects. With these innovations, our unified framework significantly improves the quality and realism of portrait editing results. Further, the multi-source nature of Total-Editing supports more flexible applications, such as illumination transfer from one portrait to another, or portrait animation with customized backgrounds.

Paper Structure

This paper contains 20 sections, 22 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Face reenactment under uneven lighting. Existing models like deng2024portrait4dv2 couple facial textures and lighting, resulting in fixed light and shadow that do not adapt to head movements. In contrast, our model provides more realistic portrait shading.
  • Figure 2: The framework of Total-Editing.\ref{['subsec:appearance_motion']}: Our pipeline learns to encode appearance and motion sources $\mathbf{I}_\text{app},\mathbf{I}_\text{mot}$, neutralize the expression from $\mathbf{I}_\text{app}$ and reapply the expression from $\mathbf{I}_\text{mot}$ to obtain a fused feature $\mathbf{F}$. After generating canonical space geometry and shading tri-planes $\mathbf{T}_\text{geo},\mathbf{T}_\text{shd}$, the neck pose is handled by warping features with moving least squares based deformation fields $\mathcal{R},\mathcal{T}$. \ref{['subsec:illumination']}: For the lighting source, we either estimate from a portrait image $\mathbf{I}_\text{lit}$ or pre-filter (bake) an HDR environment map $\mathbf{I}_\text{HDR}$, resulting in diffuse and specular lightmaps $\mathbf{S}_\text{d},\{\mathbf{S}_\text{s}(n)\}$. \ref{['subsec:geometry_shading']}: With the lighting information, geometry and shading decoders $\mathcal{D}_\text{geo},\mathcal{D}_\text{shd}$ decode point-wise attributes. Finally, a neural renderer and a super-resolution module render the editing result $\hat{\mathbf{I}}$.
  • Figure 3: Deformation field comparison. (a) Similar to deng2024portrait4ddeng2024portrait4dv2 we derive deformation field from FLAME meshes. Points and attached normals are sampled from the posed target space and then deformed into the unposed canonical space by deformation fields $\mathcal{T}$ and $\mathcal{R}$. (b) Surface Field (SF) based approach bergman2022generative assigns each grid to the nearest mesh triangle, leading to discontinuous deformation results. (c) In contrast, our moving least squares (MLS) based deformation field weighs per-point deformation with its inverse distance to all mesh vertices, producing smoother results.
  • Figure 4: Shading module architecture. The shading feature $\mathbf{t}_\text{shd}$ sampled from a position in the shading tri-plane is first decoded into normal $\mathbf{n}$, albedo $\mathbf{a}$, and additional features $\mathbf{w}$ for super-resolution. The viewing direction $\mathbf{v}$ is then reflected with the normal $\mathbf{n}$, resulting in a reflected viewing direction $\mathbf{r}$. $\mathbf{n}$ and $\mathbf{r}$ are concatenated with $\mathbf{t}_\text{shd}$ and mapped to a diffuse coefficient $k_\text{d}$ and specular coefficients $\{k_\text{s}(n)\}$ for various shininess values $\{n\}$, respectively. They are also used to sample a diffuse shading $\mathbf{s}_\text{d}$ and specular shadings $\{\mathbf{s}_\text{s}(n)\}$ from corresponding lightmaps. These shadings, $\mathbf{s}_\text{d},\{\mathbf{s}_\text{s}(n)\}$, are concatenated with the albedo $\mathbf{a}$ to decode a residual color $\delta\mathbf{c}$. The final color $\mathbf{c}$ at this position is obtained by combining PBR and neural residual $\delta\mathbf{c}$ in \ref{['eq:render']}.
  • Figure 5: Lightmap estimator architecture. During pre-training with synthetic data, we adopt a U-Net to encode lighting source $\mathbf{I}_\text{lit}$ and decode pixel-wise normal $\mathbf{N}$. The intermediate feature $\mathbf{z}_\text{lit}$ is used to decode diffuse and specular lightmaps $\mathbf{S}_\text{d},\{\mathbf{S}_\text{s}(n)\}$, querying with embedded lobe features $\mathbf{z}_\text{d},\{\mathbf{z}_\text{s}(n)\}$. In joint training with Total-Editing, the normal decoder is detached.
  • ...and 13 more figures