Survey on Fundamental Deep Learning 3D Reconstruction Techniques

Yonge Bai; LikHang Wong; TszYin Twan

Survey on Fundamental Deep Learning 3D Reconstruction Techniques

Yonge Bai, LikHang Wong, TszYin Twan

TL;DR

The survey analyzes three fundamental DL-driven 3D reconstruction paradigms—NeRFs, latent-diffusion-models-based approaches, and 3D Gaussian Splatting—detailing their scene representations, rendering pipelines, and optimization strategies. It highlights efficiency advances such as Instant-NGP's hash encoding, and zero-shot, single-image view synthesis via Zero-1-to-3, while also outlining limitations in data requirements, generalizability, and editing. The work discusses practical tradeoffs between implicit vs explicit representations and underscores future directions in semantic guidance, dynamic scenes, and single-view reconstruction. Together, these insights provide a cohesive roadmap for researchers and practitioners pursuing photo-realistic and efficient 3D reconstruction with DL methods.

Abstract

This survey aims to investigate fundamental deep learning (DL) based 3D reconstruction techniques that produce photo-realistic 3D models and scenes, highlighting Neural Radiance Fields (NeRFs), Latent Diffusion Models (LDM), and 3D Gaussian Splatting. We dissect the underlying algorithms, evaluate their strengths and tradeoffs, and project future research trajectories in this rapidly evolving field. We provide a comprehensive overview of the fundamental in DL-driven 3D scene reconstruction, offering insights into their potential applications and limitations.

Survey on Fundamental Deep Learning 3D Reconstruction Techniques

TL;DR

Abstract

Paper Structure (61 sections, 21 equations, 20 figures, 3 algorithms)

This paper contains 61 sections, 21 equations, 20 figures, 3 algorithms.

Abstract
Background
Neural Radiance Fields
Prior Work
Volume Rendering for View-Synthesis
Neural Networks as Shape Representations
Approach: NeRF
Neural Radiance Field Scene Representation
Volume Rendering with Radiance Fields
Optimizing a NeRF
Positional Encoding
Hierarchical Volume Sampling
Limitations
Computational Efficiency
Lack of Generalizability
...and 46 more sections

Figures (20)

Figure 1: An overview of the neural radiance field scene representation and differentiable rendering procedure. Synthesize images by sampling 5D coordinates (location and viewing direction) along camera rays (a), feeding those locations into an MLP to predict a color and volume density (b), and using volume rendering techniques to composite these values into an image (c). This rendering function is differentiable, so we can optimize our scene representation by minimizing the residual between synthesized color and ground truth of the actual color(d).
Figure 2: An overview of the NeRF model. $\mathbf{x}$ is passed into the first 8 layers, which output $\mathbf{v}$ and $\sigma$ a). $\mathbf{v}$ is concatenated with $\mathbf{d}$ and passed into the last layer, which outputs $\mathbf{c}$ b).
Figure 3: Visualizing how the model improves the positional encoding. Without it the model is unable to represent high variation geometries and textures resulting in an over smoothed, blurred appearance. Also how removing view dependency affect the models ability to render lighting and reflections.
Figure 4: Illustrating hierarchical sampling, where samples are proportional to their contribution to the final volume render.
Figure 5: Comparison made in the paper NeRF in the Wild martinbrualla2021nerf, where the original NeRF (left) noisy artifacts compared to NeRF-W (right).
...and 15 more figures

Survey on Fundamental Deep Learning 3D Reconstruction Techniques

TL;DR

Abstract

Survey on Fundamental Deep Learning 3D Reconstruction Techniques

Authors

TL;DR

Abstract

Table of Contents

Figures (20)