Table of Contents
Fetching ...

Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images

Hossein Javidnia

TL;DR

This work tackles the ill-posed problem of single-image intrinsic decomposition for faces under unconstrained lighting by introducing MAGINet, a multi-scale attention-guided network that first estimates a light-normalized diffuse albedo at $512\times512$, refines it to $1024\times1024$ with RefinementNet, and then jointly predicts a complete six-pass PBR rendering stack via a Pix2PixHD-based translator. The three-stage pipeline yields state-of-the-art albedo accuracy and markedly improves the fidelity of the full rendering stack, enabling high-quality relighting and material editing of real faces. The approach leverages a dual-attention bottleneck, adaptive feature fusion, and carefully designed losses, trained on FFHQ-UV-Intrinsics with synthetic supervision, and is validated through extensive quantitative and synthetic-data experiments. The resulting six passes (A,N,S,T,D,AO) provide a practical, editable set of assets for professional rendering pipelines, bridging the gap between learning-based relighting and traditional inverse rendering.

Abstract

Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a $512\times512$ light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to $1024\times1024$ and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.

Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images

TL;DR

This work tackles the ill-posed problem of single-image intrinsic decomposition for faces under unconstrained lighting by introducing MAGINet, a multi-scale attention-guided network that first estimates a light-normalized diffuse albedo at , refines it to with RefinementNet, and then jointly predicts a complete six-pass PBR rendering stack via a Pix2PixHD-based translator. The three-stage pipeline yields state-of-the-art albedo accuracy and markedly improves the fidelity of the full rendering stack, enabling high-quality relighting and material editing of real faces. The approach leverages a dual-attention bottleneck, adaptive feature fusion, and carefully designed losses, trained on FFHQ-UV-Intrinsics with synthetic supervision, and is validated through extensive quantitative and synthetic-data experiments. The resulting six passes (A,N,S,T,D,AO) provide a practical, editable set of assets for professional rendering pipelines, bridging the gap between learning-based relighting and traditional inverse rendering.

Abstract

Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.

Paper Structure

This paper contains 54 sections, 11 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of the proposed multi-stage neural rendering pipeline. Stage I (MAGINet) provides initial intrinsic estimates, Stage II (RefinementNet) enhances fine details, and Stage III (Pix2PixHD wang2018Pix2PixHD) synthesizes high-quality intrinsic image decomposition outputs.
  • Figure 2: Averaged per-pixel normal-map error (°) over 100 synthetic faces. Low-frequency regions ($<3^\circ$) appear black; errors concentrate along high-curvature ridges (nostrils, eyelids, ears) and masked interior cavities, explaining the long-tail statistics in Table \ref{['tab:normals_stats']}.
  • Figure 3: Qualitative results on the synthetic scan set. For three identities we show (left → right): the original Cycles render that serves as network input, a re-render produced from our predicted intrinsic passes, and a re-render from ground-truth passes. Despite being reconstructed from a single RGB image, our output faithfully reproduces skin tone, pore-level detail and global shading; residual differences are confined to high-curvature regions such as the ear rim and eyelid crease.
  • Figure 4: Comparison of diffuse albedo predictions. From left to right: Input image, Ground Truth (GT), proposed method (MAGINet + RefinementNet), U-Net-6L, SfSNet sengupta2018sfsnet, InverseFaceNet kim2018inversefacenet, and GAN2X pan2022gan2x.
  • Figure 5: Relighting results using the intrinsic decomposition obtained by our proposed method under varying illumination conditions. Columns (left → right) show original inputs alongside synthesized avatars relit under daylight and night lighting scenarios.
  • ...and 1 more figures