Table of Contents
Fetching ...

HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering

Alexander Rusnak, Frédéric Kaplan

TL;DR

This work tackles open-vocabulary 3D scene understanding at city scale by developing HAECcity, a hierarchical, superpoint-graph clustering framework powered by a mixture-of-experts graph transformer. A key contribution is a fully synthetic, hand-annotation-free labeling pipeline that derives CLIP-based per-point features from multi-view renderings, enabling open-set pseudo-labels for panoptic segmentation directly on 3D data. The approach is validated on ScanNet and, notably, the SensatUrban city-scale dataset, demonstrating competitive open-vocabulary panoptic and semantic performance with significantly faster inference than reconstruction-based methods. Overall, HAECcity advances scalable 3D scene understanding for digital twins and large urban environments by combining open-vocabulary pseudo-labeling with an efficient, hierarchical 3D backbone.

Abstract

Traditional 3D scene understanding techniques are generally predicated on hand-annotated label sets, but in recent years a new class of open-vocabulary 3D scene understanding techniques has emerged. Despite the success of this paradigm on small scenes, existing approaches cannot scale efficiently to city-scale 3D datasets. In this paper, we present Hierarchical vocab-Agnostic Expert Clustering (HAEC), after the latin word for 'these', a superpoint graph clustering based approach which utilizes a novel mixture of experts graph transformer for its backbone. We administer this highly scalable approach to the first application of open-vocabulary scene understanding on the SensatUrban city-scale dataset. We also demonstrate a synthetic labeling pipeline which is derived entirely from the raw point clouds with no hand-annotation. Our technique can help unlock complex operations on dense urban 3D scenes and open a new path forward in the processing of digital twins.

HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering

TL;DR

This work tackles open-vocabulary 3D scene understanding at city scale by developing HAECcity, a hierarchical, superpoint-graph clustering framework powered by a mixture-of-experts graph transformer. A key contribution is a fully synthetic, hand-annotation-free labeling pipeline that derives CLIP-based per-point features from multi-view renderings, enabling open-set pseudo-labels for panoptic segmentation directly on 3D data. The approach is validated on ScanNet and, notably, the SensatUrban city-scale dataset, demonstrating competitive open-vocabulary panoptic and semantic performance with significantly faster inference than reconstruction-based methods. Overall, HAECcity advances scalable 3D scene understanding for digital twins and large urban environments by combining open-vocabulary pseudo-labeling with an efficient, hierarchical 3D backbone.

Abstract

Traditional 3D scene understanding techniques are generally predicated on hand-annotated label sets, but in recent years a new class of open-vocabulary 3D scene understanding techniques has emerged. Despite the success of this paradigm on small scenes, existing approaches cannot scale efficiently to city-scale 3D datasets. In this paper, we present Hierarchical vocab-Agnostic Expert Clustering (HAEC), after the latin word for 'these', a superpoint graph clustering based approach which utilizes a novel mixture of experts graph transformer for its backbone. We administer this highly scalable approach to the first application of open-vocabulary scene understanding on the SensatUrban city-scale dataset. We also demonstrate a synthetic labeling pipeline which is derived entirely from the raw point clouds with no hand-annotation. Our technique can help unlock complex operations on dense urban 3D scenes and open a new path forward in the processing of digital twins.

Paper Structure

This paper contains 11 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A long-tail query for "The gate of King's College" - we are able to distinguish this singular, particular building despite the architectural similarity of the surrounding buildings and their association with this particular college in Cambridge.
  • Figure 2: Hand-annotated instance labels on a scene from the Scannet dataset.
  • Figure 3: Feature clustering based pseudo instance labels on the same Scannet scene.
  • Figure 4: Comparison of estimated processing speed for a scene with 15k precomputed images by the the image annotation method used. The grey section of the bars represents the time for the image to point projection mapping computations. HAEC inference occurs solely on the 3D data of the same scene, whereas these the slow preprocessing steps must be computed for every novel scene when using a reconstruction based approach. The bar for OpenSeg is representative of the preprocessing time for training samples in our approach once the synthetic images are captured.
  • Figure 5: A query for "a red car" in one of the Birmingham scenes, red indicates a higher similarity and natural colors indicate that a point is below the classification threshold.