DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries
Yu Yang, Jianbiao Mei, Liang Liu, Siliang Du, Yilin Xiao, Jongwon Ra, Yong Liu, Xiao Xu, Huifeng Wu
TL;DR
DQFormer tackles LiDAR panoptic segmentation by decoupling things/stuff queries and disentangling classification from segmentation within a unified query-based framework. It introduces a multi-scale BEV-based query generator that produces semantic-aware queries for both things and stuff, and a query-oriented mask decoder that performs masked cross-attention to decode segmentation masks, which are then fused with query semantics to yield panoptic results. The approach uses a sparse voxel backbone with multi-scale voxel features, BEV embeddings, and heatmaps for object centers and stuff regions, coupled with deep supervision via L$_{hm}$, L$_{mask}$, and L$_{sem}$. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art performance, with ablations confirming the effectiveness of decoupled queries, multi-scale fusion, and the mask-decoder design for handling small instances and diverse stuff classes, offering a streamlined and efficient alternative to multi-branch or two-stage pipelines.
Abstract
LiDAR panoptic segmentation, which jointly performs instance and semantic segmentation for things and stuff classes, plays a fundamental role in LiDAR perception tasks. While most existing methods explicitly separate these two segmentation tasks and utilize different branches (i.e., semantic and instance branches), some recent methods have embraced the query-based paradigm to unify LiDAR panoptic segmentation. However, the distinct spatial distribution and inherent characteristics of objects(things) and their surroundings(stuff) in 3D scenes lead to challenges, including the mutual competition of things/stuff and the ambiguity of classification/segmentation. In this paper, we propose decoupling things/stuff queries according to their intrinsic properties for individual decoding and disentangling classification/segmentation to mitigate ambiguity. To this end, we propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow. Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions and fusing multi-level BEV embeddings. Moreover, a query-oriented mask decoder is introduced to decode corresponding segmentation masks by performing masked cross-attention between queries and mask embeddings. Finally, the decoded masks are combined with the semantics of the queries to produce panoptic results. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our DQFormer framework.
