Table of Contents
Fetching ...

VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations

Nikolaos Tsagkas, Oisin Mac Aodha, Chris Xiaoxuan Lu

TL;DR

VL-Fields introduces a neural implicit 3D representation that grounds open-vocabulary language queries by fusing scene geometry with vision-language embeddings. Building on NeRF-like rendering and multi-resolution hash encoding, it learns a 3D feature map that predicts density, color, and a CLIP embedding per point, trained with photometric, geometric, and visual-language losses without requiring predefined object classes. Across Replica scenes, VL-Fields outperforms CLIP-Fields and LSeg in open-vocabulary semantic segmentation, demonstrating the benefit of integrating geometry with language features via neural fields. The approach shows promise for robotics, enabling compact, view-consistent semantic maps capable of open-set queries, though small-object detection and remaining noise remain areas for future improvement.

Abstract

We present Visual-Language Fields (VL-Fields), a neural implicit spatial representation that enables open-vocabulary semantic queries. Our model encodes and fuses the geometry of a scene with vision-language trained latent features by distilling information from a language-driven segmentation model. VL-Fields is trained without requiring any prior knowledge of the scene object classes, which makes it a promising representation for the field of robotics. Our model outperformed the similar CLIP-Fields model in the task of semantic segmentation by almost 10%.

VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations

TL;DR

VL-Fields introduces a neural implicit 3D representation that grounds open-vocabulary language queries by fusing scene geometry with vision-language embeddings. Building on NeRF-like rendering and multi-resolution hash encoding, it learns a 3D feature map that predicts density, color, and a CLIP embedding per point, trained with photometric, geometric, and visual-language losses without requiring predefined object classes. Across Replica scenes, VL-Fields outperforms CLIP-Fields and LSeg in open-vocabulary semantic segmentation, demonstrating the benefit of integrating geometry with language features via neural fields. The approach shows promise for robotics, enabling compact, view-consistent semantic maps capable of open-set queries, though small-object detection and remaining noise remain areas for future improvement.

Abstract

We present Visual-Language Fields (VL-Fields), a neural implicit spatial representation that enables open-vocabulary semantic queries. Our model encodes and fuses the geometry of a scene with vision-language trained latent features by distilling information from a language-driven segmentation model. VL-Fields is trained without requiring any prior knowledge of the scene object classes, which makes it a promising representation for the field of robotics. Our model outperformed the similar CLIP-Fields model in the task of semantic segmentation by almost 10%.
Paper Structure (12 sections, 3 equations, 3 figures, 4 tables)

This paper contains 12 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Our approach grounds open-vocabulary language-based queries in 3D space: • "vacuum the rug", • "clean the table", • "pick up the plant", • "dust the blinds". The colors indicate the areas in the encoded 3D space that correspond to each command.
  • Figure 2: Qualitative comparison between the ground-truth, LSeg, CLIP-Fields, and our VL-Fields semantic maps.
  • Figure 3: VL-Fields training pipeline.