Table of Contents
Fetching ...

FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Jianglong Ye, Naiyan Wang, Xiaolong Wang

TL;DR

FeatureNeRF addresses the limitation that generalizable NeRFs primarily target novel-view synthesis by enabling 3D semantic understanding through distillation of 2D vision foundation models into a NeRF. By predicting a 3D semantic feature volume alongside density and color, and by aligning NeRF-rendered features with teacher features from models like DINO and Latent Diffusion, it yields a 3D representation learned from 2D observations. The approach supports 2D/3D semantic keypoint transfer and object-part segmentation in a zero-shot or few-shot setting without 3D supervision, and maintains competitive novel-view synthesis performance. This framework has practical impact for flexible 3D understanding in real-world, cross-instance scenarios and paves the way for 3D editing and other downstream tasks using 2D foundation-model knowledge.

Abstract

Recent works on generalizable NeRFs have shown promising results on novel view synthesis from single or few images. However, such models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing. In this paper, we propose a novel framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models (e.g., DINO, Latent Diffusion). FeatureNeRF leverages 2D pre-trained foundation models to 3D space via neural rendering, and then extract deep features for 3D query points from NeRF MLPs. Consequently, it allows to map 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks. We evaluate FeatureNeRF on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation. Our extensive experiments demonstrate the effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor. Our project page is available at https://jianglongye.com/featurenerf/ .

FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

TL;DR

FeatureNeRF addresses the limitation that generalizable NeRFs primarily target novel-view synthesis by enabling 3D semantic understanding through distillation of 2D vision foundation models into a NeRF. By predicting a 3D semantic feature volume alongside density and color, and by aligning NeRF-rendered features with teacher features from models like DINO and Latent Diffusion, it yields a 3D representation learned from 2D observations. The approach supports 2D/3D semantic keypoint transfer and object-part segmentation in a zero-shot or few-shot setting without 3D supervision, and maintains competitive novel-view synthesis performance. This framework has practical impact for flexible 3D understanding in real-world, cross-instance scenarios and paves the way for 3D editing and other downstream tasks using 2D foundation-model knowledge.

Abstract

Recent works on generalizable NeRFs have shown promising results on novel view synthesis from single or few images. However, such models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing. In this paper, we propose a novel framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models (e.g., DINO, Latent Diffusion). FeatureNeRF leverages 2D pre-trained foundation models to 3D space via neural rendering, and then extract deep features for 3D query points from NeRF MLPs. Consequently, it allows to map 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks. We evaluate FeatureNeRF on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation. Our extensive experiments demonstrate the effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor. Our project page is available at https://jianglongye.com/featurenerf/ .
Paper Structure (16 sections, 9 equations, 15 figures, 4 tables)

This paper contains 16 sections, 9 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: While most generalizable NeRFs focus on novel-view synthesis, we propose a framework named FeatureNeRF to learn 3D semantic representations by distilling vision foundation models. After distillation, FeatureNeRF allows to render novel-view feature maps given a single input image (a), which can be leveraged to various downstream tasks. Here, we show how we propagate part segmentation labels and keypoints to different views and instances in both 2D and 3D domains (b).
  • Figure 2: Pipeline of FeatureNeRF. Given a single image $I$ as input, FeatureNeRF adopts an encoder to extract the image feature $E_{\pi(\mathbf{x})}$, and then concatenate it with the query point $\mathbf{x}$ as well as the view direction $\mathbf{d}$ as the inputs for NeRF MLPs. Apart from density $\sigma$ and color $\mathbf{c}$, we add two MLP branches to predict the feature vector $\mathbf{v}$ and coordinate $\hat{\mathbf{x}}$, which are supervised by two novel loss terms $\mathcal{L}_{\mathrm{distill}}$ and $\mathcal{L}_{\mathrm{coord}}$ respectively. Consequently, we distill knowledge from 2D vision foundation models to FeatureNeRF. Besides, we propose to extract internal NeRF feature $\mathbf{v}_{\mathrm{NeRF}}$ as 3D-consistent feature representation.
  • Figure 3: Correspondence accuracy for cross-instance semantic keypoints transfer. The first row is for 2D keypoints transfer and the second row is for 3D. Our approach distilled with different features consistently outperforms baselines for all categories in both 2D and 3D domains.
  • Figure 4: Qualitative results for cross-instance semantic keypoints transfer. Both 2D (a) and 3D (b) results are presented here. Each row contains a source image with keypoints annotations and its pairwise transfer results.
  • Figure 5: Qualitative results for cross-instance part segmentation label transfer. Each row contains a source image and its 2D/3D transfer results. After distilling, FeatureNeRF learns richer semantic information, produces better boundaries and preserves details like small parts. Note that the segmentation label for the source instance is omitted.
  • ...and 10 more figures