Table of Contents
Fetching ...

Structurally Disentangled Feature Fields Distillation for 3D Understanding and Editing

Yoel Levy, David Shavin, Itai Lang, Sagie Benaim

TL;DR

The paper tackles the limitation of using a single view-independent 3D feature field when distilling 2D features into 3D representations. It introduces structurally disentangled feature fields comprising a view-independent component and a view-dependent (reflective) component learned from 2D supervision, enabling independent segmentation and editing of objects and their reflections. The method combines a NeRF-like backbone with explicit decomposition of color and features, and uses a two-stage training regime guided by pretrained 2D features (e.g., DINOv2) to achieve improved 3D segmentation, reflection segmentation, removal, and editing (color and roughness) while preserving realistic reflections. This approach yields state-of-the-art performance on 3D segmentation and enables novel applications like reflective-region removal and component-wise editing, with practical implications for 3D understanding and content creation from 2D supervision. The work opens avenues for more physically grounded 3D scene manipulation and suggests future exploration of additional physical factors and 2D feature alignment.

Abstract

Recent work has demonstrated the ability to leverage or distill pre-trained 2D features obtained using large pre-trained 2D models into 3D features, enabling impressive 3D editing and understanding capabilities using only 2D supervision. Although impressive, models assume that 3D features are captured using a single feature field and often make a simplifying assumption that features are view-independent. In this work, we propose instead to capture 3D features using multiple disentangled feature fields that capture different structural components of 3D features involving view-dependent and view-independent components, which can be learned from 2D feature supervision only. Subsequently, each element can be controlled in isolation, enabling semantic and structural understanding and editing capabilities. For instance, using a user click, one can segment 3D features corresponding to a given object and then segment, edit, or remove their view-dependent (reflective) properties. We evaluate our approach on the task of 3D segmentation and demonstrate a set of novel understanding and editing tasks.

Structurally Disentangled Feature Fields Distillation for 3D Understanding and Editing

TL;DR

The paper tackles the limitation of using a single view-independent 3D feature field when distilling 2D features into 3D representations. It introduces structurally disentangled feature fields comprising a view-independent component and a view-dependent (reflective) component learned from 2D supervision, enabling independent segmentation and editing of objects and their reflections. The method combines a NeRF-like backbone with explicit decomposition of color and features, and uses a two-stage training regime guided by pretrained 2D features (e.g., DINOv2) to achieve improved 3D segmentation, reflection segmentation, removal, and editing (color and roughness) while preserving realistic reflections. This approach yields state-of-the-art performance on 3D segmentation and enables novel applications like reflective-region removal and component-wise editing, with practical implications for 3D understanding and content creation from 2D supervision. The work opens avenues for more physically grounded 3D scene manipulation and suggests future exploration of additional physical factors and 2D feature alignment.

Abstract

Recent work has demonstrated the ability to leverage or distill pre-trained 2D features obtained using large pre-trained 2D models into 3D features, enabling impressive 3D editing and understanding capabilities using only 2D supervision. Although impressive, models assume that 3D features are captured using a single feature field and often make a simplifying assumption that features are view-independent. In this work, we propose instead to capture 3D features using multiple disentangled feature fields that capture different structural components of 3D features involving view-dependent and view-independent components, which can be learned from 2D feature supervision only. Subsequently, each element can be controlled in isolation, enabling semantic and structural understanding and editing capabilities. For instance, using a user click, one can segment 3D features corresponding to a given object and then segment, edit, or remove their view-dependent (reflective) properties. We evaluate our approach on the task of 3D segmentation and demonstrate a set of novel understanding and editing tasks.

Paper Structure

This paper contains 28 sections, 14 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The method's pipeline. We decompose the appearance color of the scene $\mathbf{c}$ into physical components $\mathbf{c}_{indep}$ and $\mathbf{c}_{ref}$ and sum them to compute the color of the scene at location $\mathbf{x}$ and viewing direction $\mathbf{d}$. We also learn a decomposed feature field for the scene, $\mathbf{f}_{indep}$ and $\mathbf{f}_{ref}$, which enables physically-oriented semantic understanding and editing applications. Please see \ref{['sec:method']} for more details.
  • Figure 2: PCA of DINOv2 features for ground-truth input views of the Sedan scene from real-world dataset of verbin2022refnerf. We zoom in on the windshield, illustrating differences in corresponding locations between views.
  • Figure 3: 3D objects segmentation from three novel views, for the Sedan scene from real-world RefNeRF verbin2022refnerf dataset for the objects of Bonet-top, Windshield, Hubcups and Wheels, and for the Car scene from synthetic Shiny Blender verbin2022refnerf dataset for the objects of Windshield and Wheels. We compare our result to DFF kobayashi2022decomposing and to a baseline where DFF is optimized for features while RefNeRF is optimized for appearance (see \ref{['sec:semantic_segmentation']}).
  • Figure 4: Segmentation of the spheres for novel views of the Garden-spheres real-world scene using either the full segmentation of the sphere (second row) or only the reflective component of the spheres (third row).
  • Figure 5: Segmentation of the reflective component of different semantic components of the real-world car scene. The first row displays three novel views. We then demonstrate the segmentation of the reflective component of (1). Both the bonnet-top and the windshield (second row), (2). The bonnet-top (third row), and (3). The windshield (fourth row).
  • ...and 4 more figures