Table of Contents
Fetching ...

GSN: Generalisable Segmentation in Neural Radiance Field

Vinayak Gupta, Rahul Goel, Sirikonda Dhawal, P. J. Narayanan

TL;DR

The paper addresses the limitation of traditional and many generalised radiance-field methods that either overfit to a scene or struggle to provide consistent semantic labels across unseen scenes. It introduces GSN, a generalised NeRF Transformer that distills multiple semantic feature fields into a single generalisable representation, enabling on-the-fly novel-view rendering with per-pixel semantics. A two-stage training paradigm combines RGB view synthesis across scenes (Stage I) with feature distillation to a student head guided by a teacher (Stage II), using semantic cues such as DINO to drive segmentation. The approach achieves segmentation performance on par with scene-specific methods on LLFF data, demonstrates multi-view consistency, and supports integrating diverse semantic fields, offering a practical path toward scalable, semantic-rich generalisable radiance fields for downstream tasks.

Abstract

Traditional Radiance Field (RF) representations capture details of a specific scene and must be trained afresh on each scene. Semantic feature fields have been added to RFs to facilitate several segmentation tasks. Generalised RF representations learn the principles of view interpolation. A generalised RF can render new views of an unknown and untrained scene, given a few views. We present a way to distil feature fields into the generalised GNT representation. Our GSN representation generates new views of unseen scenes on the fly along with consistent, per-pixel semantic features. This enables multi-view segmentation of arbitrary new scenes. We show different semantic features being distilled into generalised RFs. Our multi-view segmentation results are on par with methods that use traditional RFs. GSN closes the gap between standard and generalisable RF methods significantly. Project Page: https://vinayak-vg.github.io/GSN/

GSN: Generalisable Segmentation in Neural Radiance Field

TL;DR

The paper addresses the limitation of traditional and many generalised radiance-field methods that either overfit to a scene or struggle to provide consistent semantic labels across unseen scenes. It introduces GSN, a generalised NeRF Transformer that distills multiple semantic feature fields into a single generalisable representation, enabling on-the-fly novel-view rendering with per-pixel semantics. A two-stage training paradigm combines RGB view synthesis across scenes (Stage I) with feature distillation to a student head guided by a teacher (Stage II), using semantic cues such as DINO to drive segmentation. The approach achieves segmentation performance on par with scene-specific methods on LLFF data, demonstrates multi-view consistency, and supports integrating diverse semantic fields, offering a practical path toward scalable, semantic-rich generalisable radiance fields for downstream tasks.

Abstract

Traditional Radiance Field (RF) representations capture details of a specific scene and must be trained afresh on each scene. Semantic feature fields have been added to RFs to facilitate several segmentation tasks. Generalised RF representations learn the principles of view interpolation. A generalised RF can render new views of an unknown and untrained scene, given a few views. We present a way to distil feature fields into the generalised GNT representation. Our GSN representation generates new views of unseen scenes on the fly along with consistent, per-pixel semantic features. This enables multi-view segmentation of arbitrary new scenes. We show different semantic features being distilled into generalised RFs. Our multi-view segmentation results are on par with methods that use traditional RFs. GSN closes the gap between standard and generalisable RF methods significantly. Project Page: https://vinayak-vg.github.io/GSN/
Paper Structure (25 sections, 2 equations, 12 figures, 1 table)

This paper contains 25 sections, 2 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Overview: Stage I: 1) We aggregate the features from the source views in View Transformer constrained by the epipolar geometry 2) The point aggregated features are passed on to the ray transformer along with input positions to aggregate the information along the ray. 3) The ray aggregated features and input view direction are passed onto an MLP and pooled to obtain pixel-wise colour. Stage II: 4) The view-independent features from the ray transformer are passed on to the stage-II block and aggregated by the view transformer and the ray transformer using the source view features extracted from the image using a pre-trained model like DINO. 5) The features out of the ray transformer are concatenated with input positions and pooled to predict pixel-wise features of the corresponding target-view pixel.
  • Figure 2: Comparison: Row 1 shows the reference scenes. Row 2 shows the segmentation results of N3F/DFF N3FDFF with the corresponding patch query. Row 3 shows segmentation results of ISRF isrf with strokes. Row 4 shows the segmentation results of our GSN method. It is to be noted that the previous methods rely on scene-specific training to enable segmentation. For more details (highlighted boxes), please refer to the Results section in the manuscript.
  • Figure 3: Row 1 shows DINO dino features computed in various settings on Horns from LLFF mildenhall2019llff. For visual simplification, a 3-dimensional PCA has been done on the features. Col. 1 and Col. 2 show DINO features computed on the original image and on GNT's varma2022attention output respectively. Col. 3 shows features predicted by our GSN method. The boxes highlight clear feature differences. We demonstrate better feature quality by doing K-Means Clustering on the feature images as shown in Row 2. Our method gives clear, noise-free clusters as shown by the boxes.
  • Figure 4: Other Semantic Fields: Col. 1 shows the Flower scene with DINOv2 dinov2 features distilled into it. We show part segmentation of the flowers, i.e., the parts of each flower are coloured the same, depicting the distillation of appropriate features. Col. 2 shows the result on the Fortress scene when SAM sam features are distilled into our GSN model, and the SAM decoder is used to segment the image. Col. 3 shows the distillation of CLIP clip features into the T-Rex scene. We use the text-prompt "a fossil of dinosaur" to localise the object in the rendered image. The heat map shows how well the pixel corresponds to the text prompt. This figure depicts that our generalised GSN model can incorporate various semantic features.
  • Figure 5: Left and right images show segmentation results with original GNT and our GSN architectures respectively. We input the same single stroke and threshold for segmentation. In our case, the features are more coherent, leading to more accurate segmentation.
  • ...and 7 more figures