Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields
Zhiting Mei, Ola Shorinwa, Anirudha Majumdar
TL;DR
The paper investigates whether geometry-grounded pretrained semantics confer advantages for distilled radiance fields over visual-only semantics. It compares VGGT-derived visual-geometry features with DINO/CLIP-based visual features across geometry fidelity, semantic localization, and radiance-field inversion, and introduces SPINE for zero-guess radiance-field inversion. It finds that geometry-grounded features encode finer geometric detail but do not consistently improve localization or pose estimation, while visual-only features remain more versatile for downstream tasks. The work highlights the need for self-supervised geometry grounding and improved strategies to effectively fuse geometric content with visual semantics in open-vocabulary robotics.
Abstract
Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.
