Table of Contents
Fetching ...

Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, Achuta Kadambi

TL;DR

In addition to radiance field rendering, this work enables 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation, and is the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model.

Abstract

3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times, some work has emerged that aims to extend the functionality of NeRF beyond view synthesis, for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundation models. However, these methods have two major limitations: (a) they are limited by the rendering speed of NeRF pipelines, and (b) implicitly represented feature fields suffer from continuity artifacts reducing feature quality. Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework encounters significant challenges, notably the disparities in spatial resolution and channel consistency between RGB images and feature maps. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model. Project website at: https://feature-3dgs.github.io/

Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields

TL;DR

In addition to radiance field rendering, this work enables 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation, and is the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model.

Abstract

3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times, some work has emerged that aims to extend the functionality of NeRF beyond view synthesis, for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundation models. However, these methods have two major limitations: (a) they are limited by the rendering speed of NeRF pipelines, and (b) implicitly represented feature fields suffer from continuity artifacts reducing feature quality. Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework encounters significant challenges, notably the disparities in spatial resolution and channel consistency between RGB images and feature maps. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model. Project website at: https://feature-3dgs.github.io/
Paper Structure (26 sections, 7 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 7 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: An overview of our method. We adopt the same 3D Gaussian initialization from sparse SfM point clouds as utilized in 3DGS, with the addition of an essential attribute: the semantic feature. Our primary innovation lies in the development of a Parallel N-dimensional Gaussian Rasterizer, complemented by a convolutional speed-up module as an optional branch. This configuration is adept at rapidly rendering arbitrarily high-dimensional features without sacrificing downstream performance.
  • Figure 2: Novel view semantic segmentation (LSeg) results on scenes from Replica dataset straub2019replica and LLFF dataset mildenhall2019local. (a) We show examples of original images in training views together with the ground-truth feature visualizations. (b) We compare the qualitative segmentation results using our Feature 3DGS with the NeRF-DFF kobayashi2022decomposing. Our inference is 1.66$\times$ faster when rendered feature $dim = 128$. Our method demonstrates more fine-grained segmentation results with higher-quality feature maps.
  • Figure 3: Comparison of SAM segmentation results obtained by (a) naively applying the SAM encoder-decoder module to a novel-view rendered image with (b) directly decoding a rendered feature. Our method is up to $1.7\times$ faster in total inference speed including rendering and segmentation while preserving the quality of segmentation masks. Scene from hedman2018deep.
  • Figure 4: Novel view segmentation (SAM) results compared with NeRF-DFF. (Upper) NeRF-DFF method presents lower-quality segmentation masks - note the failure on segmenting the cup from the bear and the coarse-grained mask boundary on the bear's leg in box-prompted results. (Lower) Our method provides higher-quality masks with more fine-grained segmentation details. Scene from kerr2023lerf.
  • Figure 5: Demonstration of results with various language-guided edit operations by querying the 3D feature field and comparison with NeRF-DFF (a) We compare our edit results with NeRF-DFF method on the sample dataset provided by NeRF-DFF kobayashi2022decomposing. Note that our method outperforms NeRF-DFF method by extracting the entire banana hidden by an apple in the original image and with less floaters in the background. (b) We demonstrate results with deletion and appearance modification on different targets. Note that the car is deleted with background preserved, and the appearance of the leaves changes with the appearance of the stop sign remained the same.
  • ...and 5 more figures