Table of Contents
Fetching ...

Mask4Former: Mask Transformer for 4D Panoptic Segmentation

Kadir Yilmaz, Jonas Schult, Alexey Nekrasov, Bastian Leibe

TL;DR

Mask4Former introduces a transformer-based framework for 4D panoptic segmentation of LiDAR sequences that unifies semantic labeling and instance tracking without hand-crafted clustering. It leverages spatio-temporal instance queries and a 6-DOF bounding-box regression auxiliary task to produce spatially compact, trackable predictions, achieving state-of-the-art performance on SemanticKITTI. The approach integrates a sparse 4D backbone, cross-attention-based query refinement, and Hungarian matching with a joint mask and semantic loss, enabling end-to-end learning of both segmentation and tracking. The results demonstrate strong classification and association performance, indicating practical impact for robust autonomous navigation in dynamic environments.

Abstract

Accurately perceiving and tracking instances over time is essential for the decision-making processes of autonomous agents interacting safely in dynamic environments. With this intention, we propose Mask4Former for the challenging task of 4D panoptic segmentation of LiDAR point clouds. Mask4Former is the first transformer-based approach unifying semantic instance segmentation and tracking of sparse and irregular sequences of 3D point clouds into a single joint model. Our model directly predicts semantic instances and their temporal associations without relying on hand-crafted non-learned association strategies such as probabilistic clustering or voting-based center prediction. Instead, Mask4Former introduces spatio-temporal instance queries that encode the semantic and geometric properties of each semantic tracklet in the sequence. In an in-depth study, we find that promoting spatially compact instance predictions is critical as spatio-temporal instance queries tend to merge multiple semantically similar instances, even if they are spatially distant. To this end, we regress 6-DOF bounding box parameters from spatio-temporal instance queries, which are used as an auxiliary task to foster spatially compact predictions. Mask4Former achieves a new state-of-the-art on the SemanticKITTI test set with a score of 68.4 LSTQ.

Mask4Former: Mask Transformer for 4D Panoptic Segmentation

TL;DR

Mask4Former introduces a transformer-based framework for 4D panoptic segmentation of LiDAR sequences that unifies semantic labeling and instance tracking without hand-crafted clustering. It leverages spatio-temporal instance queries and a 6-DOF bounding-box regression auxiliary task to produce spatially compact, trackable predictions, achieving state-of-the-art performance on SemanticKITTI. The approach integrates a sparse 4D backbone, cross-attention-based query refinement, and Hungarian matching with a joint mask and semantic loss, enabling end-to-end learning of both segmentation and tracking. The results demonstrate strong classification and association performance, indicating practical impact for robust autonomous navigation in dynamic environments.

Abstract

Accurately perceiving and tracking instances over time is essential for the decision-making processes of autonomous agents interacting safely in dynamic environments. With this intention, we propose Mask4Former for the challenging task of 4D panoptic segmentation of LiDAR point clouds. Mask4Former is the first transformer-based approach unifying semantic instance segmentation and tracking of sparse and irregular sequences of 3D point clouds into a single joint model. Our model directly predicts semantic instances and their temporal associations without relying on hand-crafted non-learned association strategies such as probabilistic clustering or voting-based center prediction. Instead, Mask4Former introduces spatio-temporal instance queries that encode the semantic and geometric properties of each semantic tracklet in the sequence. In an in-depth study, we find that promoting spatially compact instance predictions is critical as spatio-temporal instance queries tend to merge multiple semantically similar instances, even if they are spatially distant. To this end, we regress 6-DOF bounding box parameters from spatio-temporal instance queries, which are used as an auxiliary task to foster spatially compact predictions. Mask4Former achieves a new state-of-the-art on the SemanticKITTI test set with a score of 68.4 LSTQ.
Paper Structure (7 sections, 3 equations, 4 figures, 6 tables)

This paper contains 7 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Spatially non-compact instances. Naively adapted for 4D panoptic segmentation, mask transformer approaches reveal a crucial shortcoming: instance predictions tend to be spatially non-compact. As a result, the baseline model predicts two cars as a single object (left). To overcome this limitation, we introduce Mask4Former, which additionally regresses 6-DOF bounding box parameters for the instance trajectory. We find that optimizing these bounding box parameters provides a valuable loss signal that promotes spatially compact instances (right).
  • Figure 2: Illustration of the Mask4Former model. We superimpose a sequence of $T$ point clouds into a spatio-temporal representation which is subsequently processed by a sparse convolutional feature backbone . Given a multi-scale feature representation extracted from the feature backbone, the transformer decoder iteratively refines spatio-temporal (ST) instance queries. A mask module consumes ST queries and point features at various scales and predicts semantic class probabilities, instance heatmaps, and a 6-DOF bounding box for each ST query.
  • Figure 3: Visualization of learned point representations. We use PCA to project the learned point representation of instances into RGB space. Our model trained without bounding box supervision, exhibits reduced variance in its feature representation for instances. In contrast, Mask4Former effectively separates distinct instances in the feature space.
  • Figure 4: Qualitative Results. We show color-coded instance tracks over 8 superimposed frames in a spatio-temporal point cloud and a failure where a pedestrian track is split due to an observation being outside of the LiDAR's field of view.