Table of Contents
Fetching ...

G3R: Gradient Guided Generalizable Reconstruction

Yun Chen, Jingkang Wang, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun

TL;DR

G3R is proposed to learn a reconstruction network that takes the gradient feedback signals from differentiable rendering to iteratively update a 3D scene representation, combining the benefits of high photorealism from per-scene optimization with data-driven priors from fast feed-forward prediction methods.

Abstract

Large scale 3D scene reconstruction is important for applications such as virtual reality and simulation. Existing neural rendering approaches (e.g., NeRF, 3DGS) have achieved realistic reconstructions on large scenes, but optimize per scene, which is expensive and slow, and exhibit noticeable artifacts under large view changes due to overfitting. Generalizable approaches or large reconstruction models are fast, but primarily work for small scenes/objects and often produce lower quality rendering results. In this work, we introduce G3R, a generalizable reconstruction approach that can efficiently predict high-quality 3D scene representations for large scenes. We propose to learn a reconstruction network that takes the gradient feedback signals from differentiable rendering to iteratively update a 3D scene representation, combining the benefits of high photorealism from per-scene optimization with data-driven priors from fast feed-forward prediction methods. Experiments on urban-driving and drone datasets show that G3R generalizes across diverse large scenes and accelerates the reconstruction process by at least 10x while achieving comparable or better realism compared to 3DGS, and also being more robust to large view changes.

G3R: Gradient Guided Generalizable Reconstruction

TL;DR

G3R is proposed to learn a reconstruction network that takes the gradient feedback signals from differentiable rendering to iteratively update a 3D scene representation, combining the benefits of high photorealism from per-scene optimization with data-driven priors from fast feed-forward prediction methods.

Abstract

Large scale 3D scene reconstruction is important for applications such as virtual reality and simulation. Existing neural rendering approaches (e.g., NeRF, 3DGS) have achieved realistic reconstructions on large scenes, but optimize per scene, which is expensive and slow, and exhibit noticeable artifacts under large view changes due to overfitting. Generalizable approaches or large reconstruction models are fast, but primarily work for small scenes/objects and often produce lower quality rendering results. In this work, we introduce G3R, a generalizable reconstruction approach that can efficiently predict high-quality 3D scene representations for large scenes. We propose to learn a reconstruction network that takes the gradient feedback signals from differentiable rendering to iteratively update a 3D scene representation, combining the benefits of high photorealism from per-scene optimization with data-driven priors from fast feed-forward prediction methods. Experiments on urban-driving and drone datasets show that G3R generalizes across diverse large scenes and accelerates the reconstruction process by at least 10x while achieving comparable or better realism compared to 3DGS, and also being more robust to large view changes.
Paper Structure (57 sections, 5 equations, 21 figures, 7 tables, 4 algorithms)

This paper contains 57 sections, 5 equations, 21 figures, 7 tables, 4 algorithms.

Figures (21)

  • Figure 1: Gradient Guided Generalizable Reconstruction (G3R): Our method learns a single reconstruction network that takes multi-view camera images and an initial point set to predict the 3D representation for large scenes ($> 10,000 m^2$) in two minutes or less, enabling realistic and real-time camera simulation.
  • Figure 2: Three paradigms for scene reconstruction and novel view synthesis (NVS). (a) Existing generalizable approaches select a few reference images (usually $\le 5$) for feed-forward prediction of intermediate representation and then decode/render the feature representation to produce the rendered images. (b) Per-scene optimization approaches take all source images (e.g., $> 100$ for large scenes) and reconstructs a 3D representation via energy minimization and differentiable rendering. (c) G3R conducts iterative prediction to refine the 3D representation with the 3D gradient guidance (i.e., learned optimization) taking all source images. Compared to the other two paradigms, G3R leverages the benefits of both worlds (data-driven priors, gradient feedback) and achieves the best trade-off between the reconstruction quality and time (rightmost).
  • Figure 3: Method overview. We model the generalizable reconstruction as an iterative process, where the 3D neural Gaussians $\mathcal{S}^{(t)}$ are iteratively refined with reconstruction network $G_\theta$. We first lift the source 2D images ${\mathbf{I}}^{\mathrm{src}}$ to 3D space by backpropogating the rendering procedure to get the gradients w.r.t the representation $\nabla_{\mathcal{S}^{(t)}}$ (blue arrow). Then the reconstruction network $G_\theta$ takes the 3D representation $\mathcal{S}^{(t)}$, the gradient $\nabla_{\mathcal{S}^{(t)}}$ and the iteration step $t$ as input, and predicts an updated 3D representation $\mathcal{S}^{(t+1)}$. To train the network, we render $\mathcal{S}^{(t+1)}$ at source and novel views, and compute loss. The backward gradient flow for training $G_\theta$ is highlighted with dashed blue arrows.
  • Figure 4: Qualitative comparison to generalizable approaches on PandaSet.
  • Figure 5: Qualitative comparison to generalizable approaches on BlendedMVS.
  • ...and 16 more figures