Table of Contents
Fetching ...

DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries

Yu Yang, Jianbiao Mei, Liang Liu, Siliang Du, Yilin Xiao, Jongwon Ra, Yong Liu, Xiao Xu, Huifeng Wu

TL;DR

DQFormer tackles LiDAR panoptic segmentation by decoupling things/stuff queries and disentangling classification from segmentation within a unified query-based framework. It introduces a multi-scale BEV-based query generator that produces semantic-aware queries for both things and stuff, and a query-oriented mask decoder that performs masked cross-attention to decode segmentation masks, which are then fused with query semantics to yield panoptic results. The approach uses a sparse voxel backbone with multi-scale voxel features, BEV embeddings, and heatmaps for object centers and stuff regions, coupled with deep supervision via L$_{hm}$, L$_{mask}$, and L$_{sem}$. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art performance, with ablations confirming the effectiveness of decoupled queries, multi-scale fusion, and the mask-decoder design for handling small instances and diverse stuff classes, offering a streamlined and efficient alternative to multi-branch or two-stage pipelines.

Abstract

LiDAR panoptic segmentation, which jointly performs instance and semantic segmentation for things and stuff classes, plays a fundamental role in LiDAR perception tasks. While most existing methods explicitly separate these two segmentation tasks and utilize different branches (i.e., semantic and instance branches), some recent methods have embraced the query-based paradigm to unify LiDAR panoptic segmentation. However, the distinct spatial distribution and inherent characteristics of objects(things) and their surroundings(stuff) in 3D scenes lead to challenges, including the mutual competition of things/stuff and the ambiguity of classification/segmentation. In this paper, we propose decoupling things/stuff queries according to their intrinsic properties for individual decoding and disentangling classification/segmentation to mitigate ambiguity. To this end, we propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow. Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions and fusing multi-level BEV embeddings. Moreover, a query-oriented mask decoder is introduced to decode corresponding segmentation masks by performing masked cross-attention between queries and mask embeddings. Finally, the decoded masks are combined with the semantics of the queries to produce panoptic results. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our DQFormer framework.

DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries

TL;DR

DQFormer tackles LiDAR panoptic segmentation by decoupling things/stuff queries and disentangling classification from segmentation within a unified query-based framework. It introduces a multi-scale BEV-based query generator that produces semantic-aware queries for both things and stuff, and a query-oriented mask decoder that performs masked cross-attention to decode segmentation masks, which are then fused with query semantics to yield panoptic results. The approach uses a sparse voxel backbone with multi-scale voxel features, BEV embeddings, and heatmaps for object centers and stuff regions, coupled with deep supervision via L, L, and L. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art performance, with ablations confirming the effectiveness of decoupled queries, multi-scale fusion, and the mask-decoder design for handling small instances and diverse stuff classes, offering a streamlined and efficient alternative to multi-branch or two-stage pipelines.

Abstract

LiDAR panoptic segmentation, which jointly performs instance and semantic segmentation for things and stuff classes, plays a fundamental role in LiDAR perception tasks. While most existing methods explicitly separate these two segmentation tasks and utilize different branches (i.e., semantic and instance branches), some recent methods have embraced the query-based paradigm to unify LiDAR panoptic segmentation. However, the distinct spatial distribution and inherent characteristics of objects(things) and their surroundings(stuff) in 3D scenes lead to challenges, including the mutual competition of things/stuff and the ambiguity of classification/segmentation. In this paper, we propose decoupling things/stuff queries according to their intrinsic properties for individual decoding and disentangling classification/segmentation to mitigate ambiguity. To this end, we propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow. Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions and fusing multi-level BEV embeddings. Moreover, a query-oriented mask decoder is introduced to decode corresponding segmentation masks by performing masked cross-attention between queries and mask embeddings. Finally, the decoded masks are combined with the semantics of the queries to produce panoptic results. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our DQFormer framework.
Paper Structure (28 sections, 9 equations, 10 figures, 12 tables)

This paper contains 28 sections, 9 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Distinction between things and stuff in LiDAR scenes: Instances with similar geometric properties are typically concentrated in local regions, whereas distributed stuff with extensive points exhibit distinct geometries.
  • Figure 2: (a) Existing semantic/instance separation paradigm. (b) Existing learnable query-based methods ignore the distinctions between things and stuff. (c) We decouple things/stuff queries and mitigate competition between classification/segmentation for unified LiDAR panoptic segmentation.
  • Figure 3: Overview of DQFormer. (a) The feature encoder is applied to extract voxel features and point embeddings at multi-resolutions. (b) The query generator is designed to produce informative things/stuff queries with assigned semantics according to their positions and embeddings in BEV space. (c) The mask decoder performs masked cross-attention between queries and multi-level point embeddings to decode segmentation masks. Finally, the decoded masks are combined with the semantics of the queries to produce the panoptic result. Details of the decoder block are illustrated in Figure \ref{['fig:decoder']}.
  • Figure 4: Details of query proposal generation: Things queries are extracted from the BEV embedding at the corresponding positions. Stuff queries are generated using the learnable-query approach within the BEV space.
  • Figure 5: Detailed pipeline of the decoder block: consisting of masked cross-attention, self-attention, and a feed-forward network.
  • ...and 5 more figures