GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

Zi-Ting Chou; Sheng-Yu Huang; I-Jieh Liu; Yu-Chiang Frank Wang

GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

Zi-Ting Chou, Sheng-Yu Huang, I-Jieh Liu, Yu-Chiang Frank Wang

TL;DR

This paper introduces GSNeRF, a Generalizable Semantic Neural Radiance Field that jointly enables novel-view synthesis and semantic segmentation for unseen scenes. It comprises two stages: Semantic Geo-Reasoning, which derives image, semantic, and 3D volume features and predicts a target-view depth map $D_T$, and Depth-Guided Visual Rendering, which uses $D_T$ to perform depth-focused sampling for efficient volume rendering of $I_T$ and semantic rendering of $S_T$. The authors demonstrate that depth-guided sampling and a dedicated semantic renderer improve semantic accuracy by about 5% over strong baselines while maintaining competitive image quality, validated on ScanNet and Replica datasets with and without GT depth supervision. The approach advances practical 3D scene understanding by enabling on-the-fly, multi-view conditioned rendering of both appearance and semantics in unseen environments, reducing the need for scene-specific retraining and extensive annotations.

Abstract

Utilizing multi-view inputs to synthesize novel-view images, Neural Radiance Fields (NeRF) have emerged as a popular research topic in 3D vision. In this work, we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF), which uniquely takes image semantics into the synthesis process so that both novel view images and the associated semantic maps can be produced for unseen scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and Depth-Guided Visual rendering. The former is able to observe multi-view image inputs to extract semantic and geometry features from a scene. Guided by the resulting image geometry information, the latter performs both image and semantic rendering with improved performances. Our experiments not only confirm that GSNeRF performs favorably against prior works on both novel-view image and semantic segmentation synthesis but the effectiveness of our sampling strategy for visual rendering is further verified.

GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

TL;DR

, and Depth-Guided Visual Rendering, which uses

to perform depth-focused sampling for efficient volume rendering of

and semantic rendering of

. The authors demonstrate that depth-guided sampling and a dedicated semantic renderer improve semantic accuracy by about 5% over strong baselines while maintaining competitive image quality, validated on ScanNet and Replica datasets with and without GT depth supervision. The approach advances practical 3D scene understanding by enabling on-the-fly, multi-view conditioned rendering of both appearance and semantics in unseen environments, reducing the need for scene-specific retraining and extensive annotations.

Abstract

Paper Structure (39 sections, 18 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 39 sections, 18 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Neural Radiance Fields
Generalizable Novel View Synthesis
Multi-tasking NeRF
Brief Review of Generalizable NeRFs
Method
Problem Formulation and Model Overview
Generalizable Semantic NeRF
Semantic Geo-Reasoning
Depth-Guided Visual Rendering
Volume Rendering
Semantic Rendering
Training and Inference
Training
...and 24 more sections

Figures (7)

Figure A1: Overview of GSNeRF: including Semantic Geo-Reasoning and Depth-Guided Visual Rendering. Given K multi-view image $I_{1:K}$ of a scene, the Semantic Geo-Reasoner predicts the depth map $D_{1:K}$ for each source image, which is aggregated to estimate the target view depth map $D_T$. With $D_T$ as key geometric guidance, we design Depth-Guided Visual Rendering to render target view image $I_T$ and semantic segmentation $S_T$, by Volume Renderer $R_\theta$ and Semantic Renderer $P_\theta$, respectively.
Figure A2: Semantic Geo-Reasoning. The Semantic Geo-Reasoner contains a shared Encoder $E_{\theta}$, a shared Decoder $D_{\theta}$, and a cost-volume aggregator $C_{\theta}$, producing image and volume features ($f^I_{1:K}$ and $f^V_{1:K}$), with the associated depth maps $D_{1:K}$. The depth map $D_T$ of the target view are estimated from $D_{1:K}$.
Figure A3: Qualitative evaluation. We compare the visual quality of the rendered novel view images (the first three columns) and semantic segmentation maps (the last three columns) with S-Ray liu2023semantic.
Figure A4: Sampling efficiency on ScanNet. Compared to SOTAs, our GSNeRF is able to achieve significantly improved rendering performance especially when the number of sampling points is small. Even with increasing numbers of sampling points, GSNeRF still performs favorably against existing models.
Figure A5: Qualitative results of finetuning on ScanNet. Unlike our GSNeRF, S-Ray fails to capture the semantic contour of the door (in red) at the upper-left corner.
...and 2 more figures

GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

TL;DR

Abstract

GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)