Table of Contents
Fetching ...

ID-NeRF: Indirect Diffusion-guided Neural Radiance Fields for Generalizable View Synthesis

Yaokun Li, Chao Gou, Guang Tan

TL;DR

This work tackles suboptimal reprojections in generalizable NeRFs under sparse inputs by introducing ID-NeRF, which injects pre-trained diffusion priors through an indirect latent-space pathway. A two-stage latent inference process distills knowledge from a frozen diffusion model into a view-level latent $z_{tv}$, which, together with reprojected cues $p_i$, is refined by an attention-based module before NeRF decoding. The approach avoids 3D inconsistencies common with direct diffusion supervision and demonstrates state-of-the-art performance on DTU and Real Forward-Facing, especially with few input views, albeit with slower training. The work advances practical, high-quality generalizable view synthesis in sparse settings by leveraging latent diffusion priors without per-scene optimization.

Abstract

Implicit neural representations, represented by Neural Radiance Fields (NeRF), have dominated research in 3D computer vision by virtue of high-quality visual results and data-driven benefits. However, their realistic applications are hindered by the need for dense inputs and per-scene optimization. To solve this problem, previous methods implement generalizable NeRFs by extracting local features from sparse inputs as conditions for the NeRF decoder. However, although this way can allow feed-forward reconstruction, they suffer from the inherent drawback of yielding sub-optimal results caused by erroneous reprojected features. In this paper, we focus on this problem and aim to address it by introducing pre-trained generative priors to enable high-quality generalizable novel view synthesis. Specifically, we propose a novel Indirect Diffusion-guided NeRF framework, termed ID-NeRF, which leverages pre-trained diffusion priors as a guide for the reprojected features created by the previous paradigm. Notably, to enable 3D-consistent predictions, the proposed ID-NeRF discards the way of direct supervision commonly used in prior 3D generative models and instead adopts a novel indirect prior injection strategy. This strategy is implemented by distilling pre-trained knowledge into an imaginative latent space via score-based distillation, and an attention-based refinement module is then proposed to leverage the embedded priors to improve reprojected features extracted from sparse inputs. We conduct extensive experiments on multiple datasets to evaluate our method, and the results demonstrate the effectiveness of our method in synthesizing novel views in a generalizable manner, especially in sparse settings.

ID-NeRF: Indirect Diffusion-guided Neural Radiance Fields for Generalizable View Synthesis

TL;DR

This work tackles suboptimal reprojections in generalizable NeRFs under sparse inputs by introducing ID-NeRF, which injects pre-trained diffusion priors through an indirect latent-space pathway. A two-stage latent inference process distills knowledge from a frozen diffusion model into a view-level latent , which, together with reprojected cues , is refined by an attention-based module before NeRF decoding. The approach avoids 3D inconsistencies common with direct diffusion supervision and demonstrates state-of-the-art performance on DTU and Real Forward-Facing, especially with few input views, albeit with slower training. The work advances practical, high-quality generalizable view synthesis in sparse settings by leveraging latent diffusion priors without per-scene optimization.

Abstract

Implicit neural representations, represented by Neural Radiance Fields (NeRF), have dominated research in 3D computer vision by virtue of high-quality visual results and data-driven benefits. However, their realistic applications are hindered by the need for dense inputs and per-scene optimization. To solve this problem, previous methods implement generalizable NeRFs by extracting local features from sparse inputs as conditions for the NeRF decoder. However, although this way can allow feed-forward reconstruction, they suffer from the inherent drawback of yielding sub-optimal results caused by erroneous reprojected features. In this paper, we focus on this problem and aim to address it by introducing pre-trained generative priors to enable high-quality generalizable novel view synthesis. Specifically, we propose a novel Indirect Diffusion-guided NeRF framework, termed ID-NeRF, which leverages pre-trained diffusion priors as a guide for the reprojected features created by the previous paradigm. Notably, to enable 3D-consistent predictions, the proposed ID-NeRF discards the way of direct supervision commonly used in prior 3D generative models and instead adopts a novel indirect prior injection strategy. This strategy is implemented by distilling pre-trained knowledge into an imaginative latent space via score-based distillation, and an attention-based refinement module is then proposed to leverage the embedded priors to improve reprojected features extracted from sparse inputs. We conduct extensive experiments on multiple datasets to evaluate our method, and the results demonstrate the effectiveness of our method in synthesizing novel views in a generalizable manner, especially in sparse settings.
Paper Structure (16 sections, 10 equations, 5 figures, 5 tables)

This paper contains 16 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) A scenario illustrating the reprojection principle of Generalizable NeRFs; (b) The inference process of existing Gen-NeRFs; (c) Gen-NeRF under direct guidance using the modeled distribution; (d) Our model that uses indirect guidance. In this model, the reprojected features are refined using a diffusion-guided latent space (purple patch). $\mathcal{L}_s$ and $\mathcal{L}_r$ are score-based distillation loss and reconstruction loss, respectively.
  • Figure 2: Overview of our ID-NeRF. Given sparse views, there are two workflows to process them. The first (red) utilizes geometric reprojection to obtain reprojected features (RF). The other one (black) uses an inference module to predict the latent space $z_{tv}$, which is performed score-based distillation with the PDM-predicted distribution $p(z_{tv}|\gamma)$. Then, these features are fed together into the ARM to obtain the refined conditional feature $f_{c}$. $\{I^r_i\}_{i=1}^N$ and $I^r_{tv}$ are ray images used to enhance the pose information.
  • Figure 3: Qualitative comparison of rendering results. We present the rendered RGB images and depth maps of our ID-NeRF as well as representative MVSNeRF mvsnerf and MatchNeRF MatchNeRF, with each result zoomed in on details.
  • Figure 4: Comparison of different guidance approaches on the DTU dataset.
  • Figure 5: Qualitative comparison of two supervision approaches. Both methods are trained on the DTU dataset with 3 input views.