Table of Contents
Fetching ...

Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views

Yingji Zhong, Kaichen Zhou, Zhihao Li, Lanqing Hong, Zhenguo Li, Dan Xu

TL;DR

Sparse-input NeRFs suffer from shape-radiance ambiguity when data are scarce. The authors propose S^3NeRF, a self-improved framework that uses rendered semantics from dense novel views as dual-level guidance: supervision-level Bi-Directional Verification (BDV) to validate semantic labels across views and feature-level semantic-aware codebook guidance embedded in the MLP. The training objective combines a reconstruction loss $L_{\rm recon}$ and a semantic loss $L_{\rm sem}$ via $L = L_{\rm recon} + \lambda L_{\rm sem}$, with BDV producing a binary ray weight $w(\hat{\mathbf r}) \in \{0,1\}$. Evaluations on indoor datasets Replica and ScanNet++ with as few as 6 input views show consistent improvements over state-of-the-art sparse-input methods, and ablations confirm the benefits of both guidance levels; a challenging inside-out benchmark further demonstrates robustness of the semantic-guided approach.

Abstract

Neural Radiance Fields (NeRF) have shown remarkable capabilities for photorealistic novel view synthesis. One major deficiency of NeRF is that dense inputs are typically required, and the rendering quality will drop drastically given sparse inputs. In this paper, we highlight the effectiveness of rendered semantics from dense novel views, and show that rendered semantics can be treated as a more robust form of augmented data than rendered RGB. Our method enhances NeRF's performance by incorporating guidance derived from the rendered semantics. The rendered semantic guidance encompasses two levels: the supervision level and the feature level. The supervision-level guidance incorporates a bi-directional verification module that decides the validity of each rendered semantic label, while the feature-level guidance integrates a learnable codebook that encodes semantic-aware information, which is queried by each point via the attention mechanism to obtain semantic-relevant predictions. The overall semantic guidance is embedded into a self-improved pipeline. We also introduce a more challenging sparse-input indoor benchmark, where the number of inputs is limited to as few as 6. Experiments demonstrate the effectiveness of our method and it exhibits superior performance compared to existing approaches.

Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views

TL;DR

Sparse-input NeRFs suffer from shape-radiance ambiguity when data are scarce. The authors propose S^3NeRF, a self-improved framework that uses rendered semantics from dense novel views as dual-level guidance: supervision-level Bi-Directional Verification (BDV) to validate semantic labels across views and feature-level semantic-aware codebook guidance embedded in the MLP. The training objective combines a reconstruction loss and a semantic loss via , with BDV producing a binary ray weight . Evaluations on indoor datasets Replica and ScanNet++ with as few as 6 input views show consistent improvements over state-of-the-art sparse-input methods, and ablations confirm the benefits of both guidance levels; a challenging inside-out benchmark further demonstrates robustness of the semantic-guided approach.

Abstract

Neural Radiance Fields (NeRF) have shown remarkable capabilities for photorealistic novel view synthesis. One major deficiency of NeRF is that dense inputs are typically required, and the rendering quality will drop drastically given sparse inputs. In this paper, we highlight the effectiveness of rendered semantics from dense novel views, and show that rendered semantics can be treated as a more robust form of augmented data than rendered RGB. Our method enhances NeRF's performance by incorporating guidance derived from the rendered semantics. The rendered semantic guidance encompasses two levels: the supervision level and the feature level. The supervision-level guidance incorporates a bi-directional verification module that decides the validity of each rendered semantic label, while the feature-level guidance integrates a learnable codebook that encodes semantic-aware information, which is queried by each point via the attention mechanism to obtain semantic-relevant predictions. The overall semantic guidance is embedded into a self-improved pipeline. We also introduce a more challenging sparse-input indoor benchmark, where the number of inputs is limited to as few as 6. Experiments demonstrate the effectiveness of our method and it exhibits superior performance compared to existing approaches.

Paper Structure

This paper contains 21 sections, 8 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Our method exploits rendered semantics from dense novel views to boost the performance of NeRF from sparse inputs. The rendered semantics is fully exploited by semantic guidance, which encompasses supervision-level and feature-level. Our method is embedded into a self-improved framework.
  • Figure 2: Overview of the proposed S$^3$NeRF, which is built upon a self-improved pipeline. S$^3$NeRF exploits the rendered semantics from the teacher NeRF by two levels of guidance: supervision-level guidance with Bi-Directional Verification (BDV), and feature-level guidance with semantic-aware codebook. BDV returns validity maps for each semantic map, indicating the correctness of semantic labels for robust supervision. The semantic-aware codebook encodes correlation among densities, colors, and semantics, further exploiting the underlying information embedded in the semantic labels. The codebook is integrated into the MLP of the student NeRF.
  • Figure 3: Illustration of the proposed Bi-Directional Verification (BDV) module. For each rendered semantic map, BDV is applied between each source view and it by projection and verification. Based on results from the projection, the verification step produces a consensus-based verified map. A validity map is created by merging verified maps from all source views, reflecting the accuracy of the rendered semantic map. Depth maps are omitted.
  • Figure 4: Improved color predictions by the feature-level guidance, leveraging semantic-relevant information in the codebook.
  • Figure 5: (a) MLP of the semantic NeRF zhi2021place, where $\mathbf c$, $\sigma$ and $\mathbf g$ refer to color, density, and semantic logits of each input 3D point. $\mathbf f$ is the implicit feature used to predict the density. (b) Our proposed MLP for feature-level guidance incorporates a semantic-aware codebook to encode the correlation among densities, colors, and semantics. Each 3D point queries the codebook via an attention in (c). The codebook $\mathbf B$ is updated by the gradient from the reconstruction loss, while the semantic field is learned from the semantic loss.
  • ...and 9 more figures