Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models
Mohamad Al Mdfaa, Raghad Salameh, Geesara Kulathunga, Sergey Zagoruyko, Gonzalo Ferrer
TL;DR
open-vocabulary labels often fragment panoptic 3D representations, hindering both semantic richness and geometric consistency. The authors introduce Unified Promptable Panoptic Mapping (UPPM), which uses dynamic descriptors that aggregate open-vocabulary cues for each object, map them to a unified category with a size prior, and fuse them into a multi-resolution TSDF map without per-scene training. Core contributions include the Visual-Linguistic Feature Extraction++ (VLFE++), Semantic Retrieval, open-vocabulary segmentation with a custom NMS, and Unified Panoptic Fusion that maintains one descriptor per object across frames. Experiments across ScanNet v2, RIO, and Flat demonstrate gains in reconstruction accuracy and panoptic quality, while enabling language-conditioned retrieval; limitations include caption quality dependence and computational overhead, suggesting future work on lighter labeling models and dynamic outdoor scenarios.
Abstract
Panoptic maps enable robots to reason about both geometry and semantics. However, open-vocabulary models repeatedly produce closely related labels that split panoptic entities and degrade volumetric consistency. The proposed UPPM advances open-world scene understanding by leveraging foundation models to introduce a panoptic Dynamic Descriptor that reconciles open-vocabulary labels with unified category structure and geometric size priors. The fusion for such dynamic descriptors is performed within a multi-resolution multi-TSDF map using language-guided open-vocabulary panoptic segmentation and semantic retrieval, resulting in a persistent and promptable panoptic map without additional model training. Based on our evaluation experiments, UPPM shows the best overall performance in terms of the map reconstruction accuracy and the panoptic segmentation quality. The ablation study investigates the contribution for each component of UPPM (custom NMS, blurry-frame filtering, and unified semantics) to the overall system performance. Consequently, UPPM preserves open-vocabulary interpretability while delivering strong geometric and panoptic accuracy.
