Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views
Yingji Zhong, Kaichen Zhou, Zhihao Li, Lanqing Hong, Zhenguo Li, Dan Xu
TL;DR
Sparse-input NeRFs suffer from shape-radiance ambiguity when data are scarce. The authors propose S^3NeRF, a self-improved framework that uses rendered semantics from dense novel views as dual-level guidance: supervision-level Bi-Directional Verification (BDV) to validate semantic labels across views and feature-level semantic-aware codebook guidance embedded in the MLP. The training objective combines a reconstruction loss $L_{\rm recon}$ and a semantic loss $L_{\rm sem}$ via $L = L_{\rm recon} + \lambda L_{\rm sem}$, with BDV producing a binary ray weight $w(\hat{\mathbf r}) \in \{0,1\}$. Evaluations on indoor datasets Replica and ScanNet++ with as few as 6 input views show consistent improvements over state-of-the-art sparse-input methods, and ablations confirm the benefits of both guidance levels; a challenging inside-out benchmark further demonstrates robustness of the semantic-guided approach.
Abstract
Neural Radiance Fields (NeRF) have shown remarkable capabilities for photorealistic novel view synthesis. One major deficiency of NeRF is that dense inputs are typically required, and the rendering quality will drop drastically given sparse inputs. In this paper, we highlight the effectiveness of rendered semantics from dense novel views, and show that rendered semantics can be treated as a more robust form of augmented data than rendered RGB. Our method enhances NeRF's performance by incorporating guidance derived from the rendered semantics. The rendered semantic guidance encompasses two levels: the supervision level and the feature level. The supervision-level guidance incorporates a bi-directional verification module that decides the validity of each rendered semantic label, while the feature-level guidance integrates a learnable codebook that encodes semantic-aware information, which is queried by each point via the attention mechanism to obtain semantic-relevant predictions. The overall semantic guidance is embedded into a self-improved pipeline. We also introduce a more challenging sparse-input indoor benchmark, where the number of inputs is limited to as few as 6. Experiments demonstrate the effectiveness of our method and it exhibits superior performance compared to existing approaches.
