Structurally Disentangled Feature Fields Distillation for 3D Understanding and Editing
Yoel Levy, David Shavin, Itai Lang, Sagie Benaim
TL;DR
The paper tackles the limitation of using a single view-independent 3D feature field when distilling 2D features into 3D representations. It introduces structurally disentangled feature fields comprising a view-independent component and a view-dependent (reflective) component learned from 2D supervision, enabling independent segmentation and editing of objects and their reflections. The method combines a NeRF-like backbone with explicit decomposition of color and features, and uses a two-stage training regime guided by pretrained 2D features (e.g., DINOv2) to achieve improved 3D segmentation, reflection segmentation, removal, and editing (color and roughness) while preserving realistic reflections. This approach yields state-of-the-art performance on 3D segmentation and enables novel applications like reflective-region removal and component-wise editing, with practical implications for 3D understanding and content creation from 2D supervision. The work opens avenues for more physically grounded 3D scene manipulation and suggests future exploration of additional physical factors and 2D feature alignment.
Abstract
Recent work has demonstrated the ability to leverage or distill pre-trained 2D features obtained using large pre-trained 2D models into 3D features, enabling impressive 3D editing and understanding capabilities using only 2D supervision. Although impressive, models assume that 3D features are captured using a single feature field and often make a simplifying assumption that features are view-independent. In this work, we propose instead to capture 3D features using multiple disentangled feature fields that capture different structural components of 3D features involving view-dependent and view-independent components, which can be learned from 2D feature supervision only. Subsequently, each element can be controlled in isolation, enabling semantic and structural understanding and editing capabilities. For instance, using a user click, one can segment 3D features corresponding to a given object and then segment, edit, or remove their view-dependent (reflective) properties. We evaluate our approach on the task of 3D segmentation and demonstrate a set of novel understanding and editing tasks.
