Table of Contents
Fetching ...

OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

Hongjia Zhai, Qi Zhang, Xiaokun Pan, Xiyu Zhang, Yitong Dong, Huaqi Zhang, Dan Xu, Guofeng Zhang

Abstract

Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.

OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

Abstract

Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.
Paper Structure (13 sections, 12 equations, 11 figures, 5 tables)

This paper contains 13 sections, 12 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Illustration of OnlinePG, which integrates geometric reconstruction and open-vocabulary panoptic perception built upon 3D Gaussian Splatting. Given the posed video stream and 2D noise priors (mask and feature) generated by vision-language models (VLMs), OnlinePG can reconstruct a consistent 3D language feature field and perform online open-vocabulary panoptic mapping.
  • Figure 2: Overview of Our Approach. Our system performs online open-vocabulary panoptic mapping from RGB-D streams using a local-to-global paradigm. (a) Maintaining a sliding window $\mathcal{W}$ to build locally consistent instance and fuse them globally. (b) Using noisy 2D masks and features from cropformerLseg to initialize 3D Gaussian segments, which are clustered via multi-cues to form consistent 3D instances. (c) Bidirectional bipartite matching incrementally fuses local instances and spatial attributes into the global map.
  • Figure 3: Qualitative 3D Semantic Segmentation Comparison of ScanNetV2 Dataset. Our approach outperforms recent online approaches, O2V-Mapping o2v-mapping_eccv24 and OnlineAnySeg tang2025onlineanyseg, by a large margin. Compared with the offline SOTA PanoGS zhai_cvpr25_panogs, our approach still exists a small performance gap.
  • Figure 4: Qualitative Results of Open-Vocabulary Query. We use different colors to distinguish different instances found in the query. Compared to OnlineAnySeg tang2025onlineanyseg, our approach shows advanced text query capabilities for long-tail and multiple instances scenario.
  • Figure 5: Left: ablation studies of using different segments clustering cues for local map construction. Right: ablation studies of using different feature grid resolutions. The results are evaluated on ScanNetV2 dataset.
  • ...and 6 more figures