Table of Contents
Fetching ...

Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models

Mohamad Al Mdfaa, Raghad Salameh, Geesara Kulathunga, Sergey Zagoruyko, Gonzalo Ferrer

TL;DR

open-vocabulary labels often fragment panoptic 3D representations, hindering both semantic richness and geometric consistency. The authors introduce Unified Promptable Panoptic Mapping (UPPM), which uses dynamic descriptors that aggregate open-vocabulary cues for each object, map them to a unified category with a size prior, and fuse them into a multi-resolution TSDF map without per-scene training. Core contributions include the Visual-Linguistic Feature Extraction++ (VLFE++), Semantic Retrieval, open-vocabulary segmentation with a custom NMS, and Unified Panoptic Fusion that maintains one descriptor per object across frames. Experiments across ScanNet v2, RIO, and Flat demonstrate gains in reconstruction accuracy and panoptic quality, while enabling language-conditioned retrieval; limitations include caption quality dependence and computational overhead, suggesting future work on lighter labeling models and dynamic outdoor scenarios.

Abstract

Panoptic maps enable robots to reason about both geometry and semantics. However, open-vocabulary models repeatedly produce closely related labels that split panoptic entities and degrade volumetric consistency. The proposed UPPM advances open-world scene understanding by leveraging foundation models to introduce a panoptic Dynamic Descriptor that reconciles open-vocabulary labels with unified category structure and geometric size priors. The fusion for such dynamic descriptors is performed within a multi-resolution multi-TSDF map using language-guided open-vocabulary panoptic segmentation and semantic retrieval, resulting in a persistent and promptable panoptic map without additional model training. Based on our evaluation experiments, UPPM shows the best overall performance in terms of the map reconstruction accuracy and the panoptic segmentation quality. The ablation study investigates the contribution for each component of UPPM (custom NMS, blurry-frame filtering, and unified semantics) to the overall system performance. Consequently, UPPM preserves open-vocabulary interpretability while delivering strong geometric and panoptic accuracy.

Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models

TL;DR

open-vocabulary labels often fragment panoptic 3D representations, hindering both semantic richness and geometric consistency. The authors introduce Unified Promptable Panoptic Mapping (UPPM), which uses dynamic descriptors that aggregate open-vocabulary cues for each object, map them to a unified category with a size prior, and fuse them into a multi-resolution TSDF map without per-scene training. Core contributions include the Visual-Linguistic Feature Extraction++ (VLFE++), Semantic Retrieval, open-vocabulary segmentation with a custom NMS, and Unified Panoptic Fusion that maintains one descriptor per object across frames. Experiments across ScanNet v2, RIO, and Flat demonstrate gains in reconstruction accuracy and panoptic quality, while enabling language-conditioned retrieval; limitations include caption quality dependence and computational overhead, suggesting future work on lighter labeling models and dynamic outdoor scenarios.

Abstract

Panoptic maps enable robots to reason about both geometry and semantics. However, open-vocabulary models repeatedly produce closely related labels that split panoptic entities and degrade volumetric consistency. The proposed UPPM advances open-world scene understanding by leveraging foundation models to introduce a panoptic Dynamic Descriptor that reconciles open-vocabulary labels with unified category structure and geometric size priors. The fusion for such dynamic descriptors is performed within a multi-resolution multi-TSDF map using language-guided open-vocabulary panoptic segmentation and semantic retrieval, resulting in a persistent and promptable panoptic map without additional model training. Based on our evaluation experiments, UPPM shows the best overall performance in terms of the map reconstruction accuracy and the panoptic segmentation quality. The ablation study investigates the contribution for each component of UPPM (custom NMS, blurry-frame filtering, and unified semantics) to the overall system performance. Consequently, UPPM preserves open-vocabulary interpretability while delivering strong geometric and panoptic accuracy.
Paper Structure (24 sections, 5 equations, 8 figures, 8 tables)

This paper contains 24 sections, 5 equations, 8 figures, 8 tables.

Figures (8)

  • Figure S1: Overview of Unified Promptable Panoptic Mapping (UPPM). Dynamic descriptors aggregate open-vocabulary labels, unified semantic categories, and size priors and are fused into a multi-resolution multi-TSDF map that preserves panoptic consistency while remaining queryable with natural-language prompts.
  • Figure S2: Language-conditioned object retrieval via dynamic descriptors. Different textual queries, including synonyms and paraphrases, consistently retrieve the same 3D object because each submap stores all accumulated open-vocabulary labels alongside a unified category and size prior.
  • Figure S3: System overview of Unified Promptable Panoptic Mapping (UPPM). RGB frames are processed by VLFE++ to generate caption- and tag-based open-vocabulary labels. Semantic Retrieval (SR) maps these labels to unified semantic categories and size priors, forming elementary descriptors $e_i = \langle o_i, E(o_i), \hat{c}_i, \hat{s}_i \rangle$. Open-vocabulary promptable panoptic segmentation consume these descriptors, together with custom NMS, to produce semantic and instance segmentations. Unified panoptic fusion then fuses geometry and elementary descriptors into a multi-resolution multi-TSDF representation with one dynamic descriptor per object, producing the unified promptable panoptic map (UPPM) $\mathcal{M} = (\mathcal{M}_g \bigoplus \mathcal{M}_d)$.
  • Figure S4: Panoptic dynamic descriptors unify open-vocabulary labels over time. As observations accumulate across frames, the descriptor remains attached to the same object in the panoptic map, preserving promptable yet category-consistent semantics.
  • Figure S5: Qualitative comparison on the RIO dataset wald2019rio. Top: Ground-truth panoptic segmentation and corresponding 3D map. Bottom: UPPM output with sharper object boundaries and more coherent semantic regions in cluttered, real-world scenes.
  • ...and 3 more figures