Table of Contents
Fetching ...

NPLMV-PS: Neural Point-Light Multi-View Photometric Stereo

Fotios Logothetis, Ignas Budvytis, Roberto Cipolla

TL;DR

This work tackles multi-view photometric stereo by explicitly leveraging per-pixel intensities through a physics-informed irradiance model, shadow ray tracing, and a fully neural material renderer. It jointly optimizes a neural SDF and a learnable BRDF, guided by rendering, silhouette, and normal losses, achieving approximately 0.2 mm Chamfer distance on dense DiLiGenT-MV data and substantially better performance than prior methods in sparse-view/light setups. The results demonstrate that incorporating pixel-level radiance information, in addition to normal estimates, yields superior shape recovery and robustness, with clear benefits in challenging configurations. This approach advances MVPS by moving beyond normal-based optimization toward an end-to-end neural rendering framework that respects light transport and material variability.

Abstract

In this work we present a novel multi-view photometric stereo (MVPS) method. Like many works in 3D reconstruction we are leveraging neural shape representations and learnt renderers. However, our work differs from the state-of-the-art multi-view PS methods such as PS-NeRF or Supernormal in that we explicitly leverage per-pixel intensity renderings rather than relying mainly on estimated normals. We model point light attenuation and explicitly raytrace cast shadows in order to best approximate the incoming radiance for each point. The estimated incoming radiance is used as input to a fully neural material renderer that uses minimal prior assumptions and it is jointly optimised with the surface. Estimated normals and segmentation maps are also incorporated in order to maximise the surface accuracy. Our method is among the first (along with Supernormal) to outperform the classical MVPS approach proposed by the DiLiGenT-MV benchmark and achieves average 0.2mm Chamfer distance for objects imaged at approx 1.5m distance away with approximate 400x400 resolution. Moreover, our method shows high robustness to the sparse MVPS setup (6 views, 6 lights) greatly outperforming the SOTA competitor (0.38mm vs 0.61mm), illustrating the importance of neural rendering in multi-view photometric stereo.

NPLMV-PS: Neural Point-Light Multi-View Photometric Stereo

TL;DR

This work tackles multi-view photometric stereo by explicitly leveraging per-pixel intensities through a physics-informed irradiance model, shadow ray tracing, and a fully neural material renderer. It jointly optimizes a neural SDF and a learnable BRDF, guided by rendering, silhouette, and normal losses, achieving approximately 0.2 mm Chamfer distance on dense DiLiGenT-MV data and substantially better performance than prior methods in sparse-view/light setups. The results demonstrate that incorporating pixel-level radiance information, in addition to normal estimates, yields superior shape recovery and robustness, with clear benefits in challenging configurations. This approach advances MVPS by moving beyond normal-based optimization toward an end-to-end neural rendering framework that respects light transport and material variability.

Abstract

In this work we present a novel multi-view photometric stereo (MVPS) method. Like many works in 3D reconstruction we are leveraging neural shape representations and learnt renderers. However, our work differs from the state-of-the-art multi-view PS methods such as PS-NeRF or Supernormal in that we explicitly leverage per-pixel intensity renderings rather than relying mainly on estimated normals. We model point light attenuation and explicitly raytrace cast shadows in order to best approximate the incoming radiance for each point. The estimated incoming radiance is used as input to a fully neural material renderer that uses minimal prior assumptions and it is jointly optimised with the surface. Estimated normals and segmentation maps are also incorporated in order to maximise the surface accuracy. Our method is among the first (along with Supernormal) to outperform the classical MVPS approach proposed by the DiLiGenT-MV benchmark and achieves average 0.2mm Chamfer distance for objects imaged at approx 1.5m distance away with approximate 400x400 resolution. Moreover, our method shows high robustness to the sparse MVPS setup (6 views, 6 lights) greatly outperforming the SOTA competitor (0.38mm vs 0.61mm), illustrating the importance of neural rendering in multi-view photometric stereo.
Paper Structure (18 sections, 5 equations, 8 figures, 3 tables)

This paper contains 18 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: In this figure we demonstrate the fragility of relying mainly on estimated normals (using hardyunips) for deep learning based sparse multi-view photometric stereo when 6 views out of 20 and 6 light sources out of 96 available are used. The first column shows the ground truth of Buddha object from dense MVPS DiLiGenT-MV benchmark. The following two columns show the estimated normals and corresponding error maps using only 6 out of 96 lights available (view 1 mean average normal error is $9.9^{\circ}$, saturated red color corresponds to $5^{\circ}$ error). Using such normals leads to a large reconstruction error using our method when pixel intensities are not leveraged (0.51mm) and previous SOTA dense MVPS method Supernormal cao2023supernormal (0.67mm). If pixel intensities are used along with estimated normals (column 6) a significantly smaller error of 0.35mm is achieved. The final two columns show the error maps of estimated shapes when all available views and lights are used. In this setting Supernormal cao2023supernormal achieves a similar reconstruction error as our method (0.21mm vs 0.19mm). Similar dynamics apply to other DiLiGenT-MV objects as shown in Tables \ref{['tab:Tab_eval_diligent']} and \ref{['tab:Tab_eval_diligent_sparse']}, strongly motivating for explicit pixel intensity modeling in MVPS methods. Note here the errors are computed as Chamfer distance while the visualisation only shows errors from reconstruction to ground truth mesh for each reconstructed mesh surface point. Note dark red corresponds to $\ge 1mm$ error in the shape error illustrations (columns 3-8).
  • Figure 2: Schematic of our overall method. Single view PS is used to obtain normal maps. Training the SDF with normal and silhouette loss (for 3 epochs only, see Section \ref{['sec:init']}) obtains a rough surface which is then refined with full volumetric rendering, explained in Figure \ref{['fig:diagram']}. The second row also shows the GT and render images (as grayscale), the rendering error (with red $\ge 0.1$) as well as the computed shadow map.
  • Figure 3: Visualisation of our volume rendering approach. Two rays with multiple ray samples $\boldsymbol{\mathbf{X}}_{ri}$, and $\boldsymbol{\mathbf{Y}}_{ri}$ are shown. The surface-ray intersection points $\boldsymbol{\mathbf{X}}_I$, and $\boldsymbol{\mathbf{Y}}_I$ are also shown as they are used to ray trace cast shadows (towards the light source at position $\boldsymbol{\mathbf{P}}$ with brightness $\phi$). Cast shadow samples are marked as $\boldsymbol{\mathbf{X}}_{si}$, and $\boldsymbol{\mathbf{Y}}_{si}$ respectively. Note that points that significantly contribute to the total rendering (though the accumulated opacity) are coloured blue and points that do not (because they are outside of the surface or occluded) are marked red. For shadow sample points rendering is not performed and so are marked black. Note that the intersection points ($\boldsymbol{\mathbf{X}}_I$, and $\boldsymbol{\mathbf{Y}}_I$) are only used to guide shadows so they are not rendered either. Finally, for the $\boldsymbol{\mathbf{X}}_{r2}$ ray sample point, normal $\boldsymbol{\mathbf{N}}$, lighting $\boldsymbol{\mathbf{L}}$ and viewing vectors $\boldsymbol{\mathbf{V}}$ (that are used for rendering) are shown with respective colors of red,green and blue.
  • Figure 4: Qualitative visualisation of re-rendering and rendering errors for synthetic and real data (left side, top and bottom 3 rows). The scaling of the error map sets red to $\ge0.1$. We note that most of the error is concentrated on the middle of concavities as self reflection is not modeled. On the right side we see renderings in a novel angle of objects recovered from real data as well as recovered albedo maps.
  • Figure 5: Qualitative results on real DiLiGenT-MV LiZWSDT20 benchmark. For each mesh vertex, the minimum distance to the GT mesh is shown with the error bars set to red corresponding to 1mm. We note that we are achieving consistent, uniform accuracy on all regions of all objects, including the concavity in the middle of Reading.
  • ...and 3 more figures