Table of Contents
Fetching ...

DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping

Felix Igelbrink, Lennart Niecksch, Martin Atzmueller, Joachim Hertzberg

TL;DR

Dense Integrated Semantic Context is introduced, featuring a novel single-pass, distance-weighted extraction mechanism that significantly surpasses current state-of-the-art zero-shot methods in both semantic accuracy and query retrieval, providing a robust, real-time capable framework for robotic deployment.

Abstract

Open-set semantic mapping enables language-driven robotic perception, but current instance-centric approaches are bottlenecked by context-depriving and computationally expensive crop-based feature extraction. To overcome this fundamental limitation, we introduce DISC (Dense Integrated Semantic Context), featuring a novel single-pass, distance-weighted extraction mechanism. By deriving high-fidelity CLIP embeddings directly from the vision transformer's intermediate layers, our approach eliminates the latency and domain-shift artifacts of traditional image cropping, yielding pure, mask-aligned semantic representations. To fully leverage these features in large-scale continuous mapping, DISC is built upon a fully GPU-accelerated architecture that replaces periodic offline processing with precise, on-the-fly voxel-level instance refinement. We evaluate our approach on standard benchmarks (Replica, ScanNet) and a newly generated large-scale-mapping dataset based on Habitat-Matterport 3D (HM3DSEM) to assess scalability across complex scenes in multi-story buildings. Extensive evaluations demonstrate that DISC significantly surpasses current state-of-the-art zero-shot methods in both semantic accuracy and query retrieval, providing a robust, real-time capable framework for robotic deployment. The full source code, data generation and evaluation pipelines will be made available at https://github.com/DFKI-NI/DISC.

DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping

TL;DR

Dense Integrated Semantic Context is introduced, featuring a novel single-pass, distance-weighted extraction mechanism that significantly surpasses current state-of-the-art zero-shot methods in both semantic accuracy and query retrieval, providing a robust, real-time capable framework for robotic deployment.

Abstract

Open-set semantic mapping enables language-driven robotic perception, but current instance-centric approaches are bottlenecked by context-depriving and computationally expensive crop-based feature extraction. To overcome this fundamental limitation, we introduce DISC (Dense Integrated Semantic Context), featuring a novel single-pass, distance-weighted extraction mechanism. By deriving high-fidelity CLIP embeddings directly from the vision transformer's intermediate layers, our approach eliminates the latency and domain-shift artifacts of traditional image cropping, yielding pure, mask-aligned semantic representations. To fully leverage these features in large-scale continuous mapping, DISC is built upon a fully GPU-accelerated architecture that replaces periodic offline processing with precise, on-the-fly voxel-level instance refinement. We evaluate our approach on standard benchmarks (Replica, ScanNet) and a newly generated large-scale-mapping dataset based on Habitat-Matterport 3D (HM3DSEM) to assess scalability across complex scenes in multi-story buildings. Extensive evaluations demonstrate that DISC significantly surpasses current state-of-the-art zero-shot methods in both semantic accuracy and query retrieval, providing a robust, real-time capable framework for robotic deployment. The full source code, data generation and evaluation pipelines will be made available at https://github.com/DFKI-NI/DISC.
Paper Structure (20 sections, 3 equations, 5 figures, 5 tables)

This paper contains 20 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example mapping results on a hm3d scene. Top: The tracked instances of the map, randomly colored. Bottom: The resulting semantic segmentation with the $\text{top}_k$ semantic classes.
  • Figure 2: Visualization of our single-pass dense feature extraction. (a) The original RGB input frame. (b) Weighted dense cosine similarity heatmap weighted for the open-vocabulary query "an image of a chair". The extracted patch features provide precise semantic grounding without requiring image crops.
  • Figure 3: Generated trajectory from hm3d scene 00800. Red: navigation mesh. Blue lines: generated trajectory. Teal points: coverage analysis from ray tracing as down sampled voxel grid.
  • Figure 4: (a) Ratio of covered objects ($Area>50\%$ of surface covered) and ratio of covered surface for our generated trajectories for hm3d. High variance is explained by the fact that the largest connected component of the navigation mesh from Habitat only covers the scene partially (Largest Island/Total Area). (b) Tour length plotted in relation to the navigable area, with color encoding the resulting dataset size (= simulation steps required).
  • Figure 5: Performance of the mapping pipeline on the exemplary hm3d scene 00849.