GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding
Zi-Ting Chou, Sheng-Yu Huang, I-Jieh Liu, Yu-Chiang Frank Wang
TL;DR
This paper introduces GSNeRF, a Generalizable Semantic Neural Radiance Field that jointly enables novel-view synthesis and semantic segmentation for unseen scenes. It comprises two stages: Semantic Geo-Reasoning, which derives image, semantic, and 3D volume features and predicts a target-view depth map $D_T$, and Depth-Guided Visual Rendering, which uses $D_T$ to perform depth-focused sampling for efficient volume rendering of $I_T$ and semantic rendering of $S_T$. The authors demonstrate that depth-guided sampling and a dedicated semantic renderer improve semantic accuracy by about 5% over strong baselines while maintaining competitive image quality, validated on ScanNet and Replica datasets with and without GT depth supervision. The approach advances practical 3D scene understanding by enabling on-the-fly, multi-view conditioned rendering of both appearance and semantics in unseen environments, reducing the need for scene-specific retraining and extensive annotations.
Abstract
Utilizing multi-view inputs to synthesize novel-view images, Neural Radiance Fields (NeRF) have emerged as a popular research topic in 3D vision. In this work, we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF), which uniquely takes image semantics into the synthesis process so that both novel view images and the associated semantic maps can be produced for unseen scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and Depth-Guided Visual rendering. The former is able to observe multi-view image inputs to extract semantic and geometry features from a scene. Guided by the resulting image geometry information, the latter performs both image and semantic rendering with improved performances. Our experiments not only confirm that GSNeRF performs favorably against prior works on both novel-view image and semantic segmentation synthesis but the effectiveness of our sampling strategy for visual rendering is further verified.
