Table of Contents
Fetching ...

Incorporating dense metric depth into neural 3D representations for view synthesis and relighting

Arkadeep Narayan Chaudhury, Igor Vasiljevic, Sergey Zakharov, Vitor Guizilini, Rares Ambrus, Srinivasa Narasimhan, Christopher G. Atkeson

TL;DR

This work tackles photo-realistic capture of small scenes in robotics by integrating dense metric depth into neural 3D representations to improve geometry and appearance from few views. It introduces a depth-informed intrinsic-appearance framework, augments several baselines with depth and depth-edge supervision, and proposes sampling strategies that distinguish depth textures from geometric edges. A robot-mounted multi-flash stereo system collects depth, depth edges, and multi-illumination data to enable accurate view synthesis and relighting, even with limited training views. The approach achieves state-of-the-art geometry and plausible relighting while identifying limitations with transparent and highly reflective materials, and it outlines a practical hardware-software stack for scalable small-scene capture.

Abstract

Synthesizing accurate geometry and photo-realistic appearance of small scenes is an active area of research with compelling use cases in gaming, virtual reality, robotic-manipulation, autonomous driving, convenient product capture, and consumer-level photography. When applying scene geometry and appearance estimation techniques to robotics, we found that the narrow cone of possible viewpoints due to the limited range of robot motion and scene clutter caused current estimation techniques to produce poor quality estimates or even fail. On the other hand, in robotic applications, dense metric depth can often be measured directly using stereo and illumination can be controlled. Depth can provide a good initial estimate of the object geometry to improve reconstruction, while multi-illumination images can facilitate relighting. In this work we demonstrate a method to incorporate dense metric depth into the training of neural 3D representations and address an artifact observed while jointly refining geometry and appearance by disambiguating between texture and geometry edges. We also discuss a multi-flash stereo camera system developed to capture the necessary data for our pipeline and show results on relighting and view synthesis with a few training views.

Incorporating dense metric depth into neural 3D representations for view synthesis and relighting

TL;DR

This work tackles photo-realistic capture of small scenes in robotics by integrating dense metric depth into neural 3D representations to improve geometry and appearance from few views. It introduces a depth-informed intrinsic-appearance framework, augments several baselines with depth and depth-edge supervision, and proposes sampling strategies that distinguish depth textures from geometric edges. A robot-mounted multi-flash stereo system collects depth, depth edges, and multi-illumination data to enable accurate view synthesis and relighting, even with limited training views. The approach achieves state-of-the-art geometry and plausible relighting while identifying limitations with transparent and highly reflective materials, and it outlines a practical hardware-software stack for scalable small-scene capture.

Abstract

Synthesizing accurate geometry and photo-realistic appearance of small scenes is an active area of research with compelling use cases in gaming, virtual reality, robotic-manipulation, autonomous driving, convenient product capture, and consumer-level photography. When applying scene geometry and appearance estimation techniques to robotics, we found that the narrow cone of possible viewpoints due to the limited range of robot motion and scene clutter caused current estimation techniques to produce poor quality estimates or even fail. On the other hand, in robotic applications, dense metric depth can often be measured directly using stereo and illumination can be controlled. Depth can provide a good initial estimate of the object geometry to improve reconstruction, while multi-illumination images can facilitate relighting. In this work we demonstrate a method to incorporate dense metric depth into the training of neural 3D representations and address an artifact observed while jointly refining geometry and appearance by disambiguating between texture and geometry edges. We also discuss a multi-flash stereo camera system developed to capture the necessary data for our pipeline and show results on relighting and view synthesis with a few training views.
Paper Structure (27 sections, 14 equations, 13 figures, 7 tables)

This paper contains 27 sections, 14 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: We present an approach for the photo-realistic capture of small scenes by incorporating dense metric depth, multi-view, and multi-illumination images into neural 3D scene understanding pipelines. We use a robot mounted multi-flash stereo camera system, developed in-house, to capture the necessary supervision signals needed to optimize our representation with a few input views. The reconstruction of the LEGO plant and the face were generated with 11 and 2 stereo pairs respectively. We relight the textured meshes using blender. Background design by Ginibird.
  • Figure 2: A snapshot of the important supervision signals. We capture a high dynamic range image mertens2007exposure and display it after tonemapping reinhard2002 in \ref{['fig:mfc_main_RGB']}. \ref{['fig:mfc_stereo_depth']} shows the scene depth (in mm) from stereo. \ref{['fig:mfc_depth_edges']} displays the likelihood of each pixel falling on a depth edge. \ref{['fig:mfc_normals']} shows the object surface normals. We note that unlike conventional stereo matching (hirschmuller2005accurate), xu2022gmflow returns locally smooth surfaces and often ignores local texture variations but is less noisy. The inset shows the surface normals on the textured aluminum plate calculated as gradients of depth from conventional stereo matching. Finally, \ref{['fig:mfc_specularity_labels']} identifies the pixels with the largest appearance variation due to moving lights. We used the system in \ref{['fig:MFC_schematic']} to capture the data.
  • Figure 3: Overview of our sampling process during training. \ref{['fig:lego_plant_gt']} is the ground truth test image. \ref{['fig:lego_plant_10_pct']} is the reconstruction of the test image after training has progressed 15% (15k gradient steps), \ref{['fig:lego_plant_100_pct']} is the reconstruction of the test image at the end of training (100k gradient steps). \ref{['fig:lego_plant_edges']} denotes the per-pixel likelihoods of depth edges in the scene at the same view captured with our device. We note in \ref{['fig:lego_plant_10_pct']}, the parts of scene with complicated geometry (foliage with many depth edges) have lower fidelity of appearance in the reconstruction at an earlier stage of training, which gradually improves in \ref{['fig:lego_plant_100_pct']}. \ref{['fig:lego_plant_sampling_10']} indicates the per-pixel sampling likelihood if the test view were to be used for training, at a training progress of 10%, \ref{['fig:lego_plant_sampling_90']} indicates the same at a progress of 90%. \ref{['eq:edge_selection_probabililty']} is used to draw the samples: $\alpha = 0.1$ and $0.9$ respectively for \ref{['fig:lego_plant_sampling_90', 'fig:lego_plant_sampling_10']}. Brighter color indicates higher sampling likelihood.
  • Figure 4: We demonstrate a corner case of jointly refining appearance and geometry. The left insets of \ref{['fig:unisurf_checker_horizontal', 'fig:unisurf_checker_vertical']} are the scene geometries recovered in the worst cases, the right insets display the better meshes recovered using the method described in \ref{['sc:appearance_and_shape']}. An image used for training and the edge map used for sampling are in the insets. We recommend zooming into the figure for details. Corresponding quantitative results are in \ref{['tab:jointly_refining_shape_appearance']}.
  • Figure 5: A multi-flash stereo camera to image small scenes.
  • ...and 8 more figures