Table of Contents
Fetching ...

Sparse 3D Reconstruction via Object-Centric Ray Sampling

Llukman Cerkezi, Paolo Favaro

TL;DR

This work tackles sparse-view 3D reconstruction from a 360° camera rig by introducing an object-centric ray sampling scheme paired with a hybrid implicit–mesh surface representation. By sampling along mesh-normal rays and sharing the same samples across all views, the method concentrates updates on consistent surface points, reducing overfitting common in view-centric NeRF-style approaches. A dual-network formulation—an ISNN for shape density and a Texture Network for color—drives a differentiable rendering pipeline, with a background model and Laplacian regularization enhancing robustness. The approach achieves state-of-the-art results on datasets like Google’s Scanned Objects, Tank & Temples, and MVMC Car under sparse views, and performs robustly without explicit segmentation masks. The combination of object-centric sampling, a flexible surface representation, and background handling offers practical gains for 3D reconstruction in real-world, partially-occluded scenes.

Abstract

We propose a novel method for 3D object reconstruction from a sparse set of views captured from a 360-degree calibrated camera rig. We represent the object surface through a hybrid model that uses both an MLP-based neural representation and a triangle mesh. A key contribution in our work is a novel object-centric sampling scheme of the neural representation, where rays are shared among all views. This efficiently concentrates and reduces the number of samples used to update the neural model at each iteration. This sampling scheme relies on the mesh representation to ensure also that samples are well-distributed along its normals. The rendering is then performed efficiently by a differentiable renderer. We demonstrate that this sampling scheme results in a more effective training of the neural representation, does not require the additional supervision of segmentation masks, yields state of the art 3D reconstructions, and works with sparse views on the Google's Scanned Objects, Tank and Temples and MVMC Car datasets. Code available at: https://github.com/llukmancerkezi/ROSTER

Sparse 3D Reconstruction via Object-Centric Ray Sampling

TL;DR

This work tackles sparse-view 3D reconstruction from a 360° camera rig by introducing an object-centric ray sampling scheme paired with a hybrid implicit–mesh surface representation. By sampling along mesh-normal rays and sharing the same samples across all views, the method concentrates updates on consistent surface points, reducing overfitting common in view-centric NeRF-style approaches. A dual-network formulation—an ISNN for shape density and a Texture Network for color—drives a differentiable rendering pipeline, with a background model and Laplacian regularization enhancing robustness. The approach achieves state-of-the-art results on datasets like Google’s Scanned Objects, Tank & Temples, and MVMC Car under sparse views, and performs robustly without explicit segmentation masks. The combination of object-centric sampling, a flexible surface representation, and background handling offers practical gains for 3D reconstruction in real-world, partially-occluded scenes.

Abstract

We propose a novel method for 3D object reconstruction from a sparse set of views captured from a 360-degree calibrated camera rig. We represent the object surface through a hybrid model that uses both an MLP-based neural representation and a triangle mesh. A key contribution in our work is a novel object-centric sampling scheme of the neural representation, where rays are shared among all views. This efficiently concentrates and reduces the number of samples used to update the neural model at each iteration. This sampling scheme relies on the mesh representation to ensure also that samples are well-distributed along its normals. The rendering is then performed efficiently by a differentiable renderer. We demonstrate that this sampling scheme results in a more effective training of the neural representation, does not require the additional supervision of segmentation masks, yields state of the art 3D reconstructions, and works with sparse views on the Google's Scanned Objects, Tank and Temples and MVMC Car datasets. Code available at: https://github.com/llukmancerkezi/ROSTER
Paper Structure (19 sections, 2 equations, 27 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 2 equations, 27 figures, 7 tables, 1 algorithm.

Figures (27)

  • Figure 1: Left: Sparse view setting of a $360^{\circ}$ camera rig with 8 views. Right: 3D reconstructions with existing SotA methods. Due to the sparsity and wide spacing of the camera views, methods such as NeRS NERS_2021_Neurips and RegNeRF regnerf_2021 reconstruct surfaces with visible artifacts. COLMAP$^*$schoenberger2016mvsSchonberger_2016_CVPR returned a valid mesh only with 50 views (so is used only as a reference). Methods such as DS diff_stereopsis_2021_arxiv obtain better reconstructions, but with fewer details than with our approach. Most methods make use of masks to segment the object in each view. In contrast, our method can work without this additional supervision and still obtain accurate 3D reconstructions (compare to the GT).
  • Figure 2: Sampling schemes. Left: NeRF view-centric sampling scheme. Right: Our object-centric sampling scheme. The view-centric sampling scheme uses separate sets of 3D samples for each camera view. This leads to overfitting when views are sparse. Object-centric sampling instead shares the same 3D samples across multiple views.
  • Figure 3: View vs object-centric sampling (see Figure \ref{['fig:sampling']}). Computational efficiency: The view-centric approach uses $8\times K$ samples per mesh vertex, with $K$ camera views. In contrast, the object-centric approach uses only $8$ samples per vertex regardless of the number of camera views. Object-centric sampling is not only more efficient but also avoids overfitting. For more details, please check Section \ref{['sec:method']}.
  • Figure 4: Detailed model representation. We feed the object-centric points to ISNN and obtain a density value. Then, we update the vertex location via eq. \ref{['eq:calc_vert_loc']} using the points sampled along the vertex normal. We repeat this operation for all vertices to get the updated mesh surface.
  • Figure 5: We assign a color to each vertex of the mesh by querying the TNN model at that vertex. Then, we feed the textured mesh and a camera viewpoint as input to a differentiable renderer to synthesize a view of the scene. The reconstruction task is based on minimizing the difference between the synthesized view and a captured image (with the same viewpoint) in both $L_1$ and perceptual norms.
  • ...and 22 more figures