Table of Contents
Fetching ...

HG3-NeRF: Hierarchical Geometric, Semantic, and Photometric Guided Neural Radiance Fields for Sparse View Inputs

Zelin Gao, Weichen Dai, Yu Zhang

TL;DR

HG3-NeRF tackles sparse-view novel view synthesis by guiding NeRF with hierarchical geometric, semantic, and photometric cues. It introduces HGG to leverage sparse depth priors from SfM through local-to-global sampling, HSG to supervise coarse-to-fine semantics via CLIP, and HPG to ensure appearance consistency across scales. The approach delivers state-of-the-art results on standard benchmarks with sparse inputs and demonstrates improved geometry and semantics in real-world space without relying on NDC representations. Ablation studies confirm the contributions of HGG and HSG, and the method reduces data requirements compared to traditional NeRF approaches.

Abstract

Neural Radiance Fields (NeRF) have garnered considerable attention as a paradigm for novel view synthesis by learning scene representations from discrete observations. Nevertheless, NeRF exhibit pronounced performance degradation when confronted with sparse view inputs, consequently curtailing its further applicability. In this work, we introduce Hierarchical Geometric, Semantic, and Photometric Guided NeRF (HG3-NeRF), a novel methodology that can address the aforementioned limitation and enhance consistency of geometry, semantic content, and appearance across different views. We propose Hierarchical Geometric Guidance (HGG) to incorporate the attachment of Structure from Motion (SfM), namely sparse depth prior, into the scene representations. Different from direct depth supervision, HGG samples volume points from local-to-global geometric regions, mitigating the misalignment caused by inherent bias in the depth prior. Furthermore, we draw inspiration from notable variations in semantic consistency observed across images of different resolutions and propose Hierarchical Semantic Guidance (HSG) to learn the coarse-to-fine semantic content, which corresponds to the coarse-to-fine scene representations. Experimental results demonstrate that HG3-NeRF can outperform other state-of-the-art methods on different standard benchmarks and achieve high-fidelity synthesis results for sparse view inputs.

HG3-NeRF: Hierarchical Geometric, Semantic, and Photometric Guided Neural Radiance Fields for Sparse View Inputs

TL;DR

HG3-NeRF tackles sparse-view novel view synthesis by guiding NeRF with hierarchical geometric, semantic, and photometric cues. It introduces HGG to leverage sparse depth priors from SfM through local-to-global sampling, HSG to supervise coarse-to-fine semantics via CLIP, and HPG to ensure appearance consistency across scales. The approach delivers state-of-the-art results on standard benchmarks with sparse inputs and demonstrates improved geometry and semantics in real-world space without relying on NDC representations. Ablation studies confirm the contributions of HGG and HSG, and the method reduces data requirements compared to traditional NeRF approaches.

Abstract

Neural Radiance Fields (NeRF) have garnered considerable attention as a paradigm for novel view synthesis by learning scene representations from discrete observations. Nevertheless, NeRF exhibit pronounced performance degradation when confronted with sparse view inputs, consequently curtailing its further applicability. In this work, we introduce Hierarchical Geometric, Semantic, and Photometric Guided NeRF (HG3-NeRF), a novel methodology that can address the aforementioned limitation and enhance consistency of geometry, semantic content, and appearance across different views. We propose Hierarchical Geometric Guidance (HGG) to incorporate the attachment of Structure from Motion (SfM), namely sparse depth prior, into the scene representations. Different from direct depth supervision, HGG samples volume points from local-to-global geometric regions, mitigating the misalignment caused by inherent bias in the depth prior. Furthermore, we draw inspiration from notable variations in semantic consistency observed across images of different resolutions and propose Hierarchical Semantic Guidance (HSG) to learn the coarse-to-fine semantic content, which corresponds to the coarse-to-fine scene representations. Experimental results demonstrate that HG3-NeRF can outperform other state-of-the-art methods on different standard benchmarks and achieve high-fidelity synthesis results for sparse view inputs.
Paper Structure (14 sections, 11 equations, 7 figures, 3 tables)

This paper contains 14 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Novel View Synthesis Results from 3 View Inputs. Left: The bias in the sparse depth prior is caused by keypoint mismatching during the multi-view stereo process of SfM. Mid: Bias in the depth prior is introduced to NeRF through depth supervision and leads to geometric misalignment. Right: Our HG$^{3}$-NeRF leverage additional guidance to learn the scene representations, showing great performance in both appearance and geometry.
  • Figure 2: Overview of HG$^{3}$-NeRF. The 3D volume location $\mathbf{x}_i$ is first sampled within the local-to-global region setup by hierarchical geometric guidance and then fed into neural radiance fields along with the viewing direction $\mathbf{d}$ to query color $\mathbf{c}_i$ and density $\sigma_{i}$. Via the volume rendering theorem, the query results are integrated into $\mathbf{\hat{C}}_{recon}$, which contains $\mathbf{\hat{C}}_{recon}^{c}$ for the coarse model and $\mathbf{\hat{C}}_{recon}^{f}$ for the fine model. Moreover, we employ CLIP to encode the image $\mathbf{\hat{C}}_{sem}$ rendered from a randomly selected pose as a feature vector $\varphi_{sem}$. The scene representations are finally optimized by the coarse-to-fine cosine similarity between $\varphi_{sem}$ and $\varphi_{i}$ from hierarchical semantic guidance as well as the MSE between $\mathbf{\hat{C}}_{recon}$ and the observed color $\mathbf{C}_{gt}$.
  • Figure 3: Bias from Multi-View Stereo. Classical stereo methods estimate depth through geometric constraints based on keypoint matching. The shift $\Delta{d}$ in ray direction is caused by keypoint mismatching and thus introduces the bias into the depth estimation. Especially when the translation between two frames is small, the bias further increases.
  • Figure 4: Semantic Consistency Effects on Multi-Resolution Images. By computing the cosine similarity of the feature vectors between the original image and its down sampling results, we can find that the cosine similarity decreases sharply with the increase of down sampling rate. Especially when the dataset is contributed by complex real-world scenes (e.g., LLFF dataset mildenhall2019local), meaning that an image contains plenty of content, the effect on semantic consistency supervision is further limited.
  • Figure 5: Qualitative Results on DTU Dataset. HG$^3$-NeRF maintain more fine details in the synthetic images and more sharper edges in the estimated depth maps.
  • ...and 2 more figures