Table of Contents
Fetching ...

Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

Zhiting Mei, Ola Shorinwa, Anirudha Majumdar

TL;DR

The paper investigates whether geometry-grounded pretrained semantics confer advantages for distilled radiance fields over visual-only semantics. It compares VGGT-derived visual-geometry features with DINO/CLIP-based visual features across geometry fidelity, semantic localization, and radiance-field inversion, and introduces SPINE for zero-guess radiance-field inversion. It finds that geometry-grounded features encode finer geometric detail but do not consistently improve localization or pose estimation, while visual-only features remain more versatile for downstream tasks. The work highlights the need for self-supervised geometry grounding and improved strategies to effectively fuse geometric content with visual semantics in open-vocabulary robotics.

Abstract

Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.

Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

TL;DR

The paper investigates whether geometry-grounded pretrained semantics confer advantages for distilled radiance fields over visual-only semantics. It compares VGGT-derived visual-geometry features with DINO/CLIP-based visual features across geometry fidelity, semantic localization, and radiance-field inversion, and introduces SPINE for zero-guess radiance-field inversion. It finds that geometry-grounded features encode finer geometric detail but do not consistently improve localization or pose estimation, while visual-only features remain more versatile for downstream tasks. The work highlights the need for self-supervised geometry grounding and improved strategies to effectively fuse geometric content with visual semantics in open-vocabulary robotics.

Abstract

Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.

Paper Structure

This paper contains 18 sections, 1 equation, 59 figures.

Figures (59)

  • Figure 1: We revisit pretrained semantics in distilled radiance fields, asking three critical questions to compare visual-geometry semantic features against visual-only features. We find that while visual-geometry features retain richer spatial fidelity, they do not improve performance in downstream tasks such as semantic localization or radiance field inversion, suggesting the greater versatility of visual-only semantic features.
  • Figure 2: (left) Semantics distillation architecture, showing co-supervision of CLIP with DINO/VGGT via base semantics module. (right) VGGT's semantic embeddings from different heads, showing high-fidelity geometric content of the point head.
  • Figure 3: Semantic content of distilled features. Whereas visual-only features provide object-level information, visual-geometry features provide more structural details, such as an object's contour.
  • Figure 4: Geometric fidelity factor (GFF) of visual-geometry and visual-only features. VGGT's features contain prominent object edges, unlike visual-only semantic features.
  • Figure 5: Semantic object localization. Both visual-only features (DINOv2/DINOv3) and visual-geometry features (VGGT) achieve similar localization accuracies (Teatime scene visuals).
  • ...and 54 more figures