HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering
Alexander Rusnak, Frédéric Kaplan
TL;DR
This work tackles open-vocabulary 3D scene understanding at city scale by developing HAECcity, a hierarchical, superpoint-graph clustering framework powered by a mixture-of-experts graph transformer. A key contribution is a fully synthetic, hand-annotation-free labeling pipeline that derives CLIP-based per-point features from multi-view renderings, enabling open-set pseudo-labels for panoptic segmentation directly on 3D data. The approach is validated on ScanNet and, notably, the SensatUrban city-scale dataset, demonstrating competitive open-vocabulary panoptic and semantic performance with significantly faster inference than reconstruction-based methods. Overall, HAECcity advances scalable 3D scene understanding for digital twins and large urban environments by combining open-vocabulary pseudo-labeling with an efficient, hierarchical 3D backbone.
Abstract
Traditional 3D scene understanding techniques are generally predicated on hand-annotated label sets, but in recent years a new class of open-vocabulary 3D scene understanding techniques has emerged. Despite the success of this paradigm on small scenes, existing approaches cannot scale efficiently to city-scale 3D datasets. In this paper, we present Hierarchical vocab-Agnostic Expert Clustering (HAEC), after the latin word for 'these', a superpoint graph clustering based approach which utilizes a novel mixture of experts graph transformer for its backbone. We administer this highly scalable approach to the first application of open-vocabulary scene understanding on the SensatUrban city-scale dataset. We also demonstrate a synthetic labeling pipeline which is derived entirely from the raw point clouds with no hand-annotation. Our technique can help unlock complex operations on dense urban 3D scenes and open a new path forward in the processing of digital twins.
