Table of Contents
Fetching ...

4D-CS: Exploiting Cluster Prior for 4D Spatio-Temporal LiDAR Semantic Segmentation

Jiexi Zhong, Zhiheng Li, Yubo Cui, Zheng Fang

TL;DR

The paper tackles the problem of inconsistent LiDAR point segmentation across space and time in autonomous driving contexts. It introduces 4D-CS, a dual-branch architecture that combines a point-based branch with multi-view temporal fusion and a cluster-based branch that uses DBSCAN-derived cluster labels and temporal cluster enhancement, followed by adaptive fusion to produce coherent predictions. Key contributions include explicit generation of cluster labels across frames, multi-view temporal fusion, temporal cluster enhancement, and an adaptive prediction fusion mechanism, achieving state-of-the-art results on SemanticKITTI and nuScenes for multi-scan semantic and moving-object segmentation. The approach improves segmentation integrity for large foreground objects and enhances motion-state estimation, with practical impact for robust autonomous perception and mapping.

Abstract

Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio-temporal information of multi-scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual-branch network, 4D-CS, which integrates point-based and cluster-based branches to enable more consistent segmentation. Specifically, in the point-based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster-based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point-wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point-cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state-of-the-art results on the multi-scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets. The code will be available at https://github.com/NEU-REAL/4D-CS.git.

4D-CS: Exploiting Cluster Prior for 4D Spatio-Temporal LiDAR Semantic Segmentation

TL;DR

The paper tackles the problem of inconsistent LiDAR point segmentation across space and time in autonomous driving contexts. It introduces 4D-CS, a dual-branch architecture that combines a point-based branch with multi-view temporal fusion and a cluster-based branch that uses DBSCAN-derived cluster labels and temporal cluster enhancement, followed by adaptive fusion to produce coherent predictions. Key contributions include explicit generation of cluster labels across frames, multi-view temporal fusion, temporal cluster enhancement, and an adaptive prediction fusion mechanism, achieving state-of-the-art results on SemanticKITTI and nuScenes for multi-scan semantic and moving-object segmentation. The approach improves segmentation integrity for large foreground objects and enhances motion-state estimation, with practical impact for robust autonomous perception and mapping.

Abstract

Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio-temporal information of multi-scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual-branch network, 4D-CS, which integrates point-based and cluster-based branches to enable more consistent segmentation. Specifically, in the point-based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster-based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point-wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point-cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state-of-the-art results on the multi-scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets. The code will be available at https://github.com/NEU-REAL/4D-CS.git.
Paper Structure (19 sections, 6 equations, 7 figures, 5 tables)

This paper contains 19 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The comparison of the baseline (WaffleIron waffle) with our proposed method on SemanticKITTI. For both methods, subfigures (a) and (c) display semantic segmentation, while subfigures (b) and (d) illustrate moving object segmentation. Subfigures (e) and (f) present clustering results of foreground objects derived from DBSCAN.
  • Figure 2: The framework of our 4D-CS. (a) In the point-based branch, we extract point-wise features and enhance them using historical knowledge through the MTF module. (b) In the cluster-based branch, cluster labels are first used as additional input to generate initial cluster features. The TCE module then integrates adjacent cluster features across multiple frames to enrich instance information, which is subsequently assigned to the corresponding points. (c) Finally, the segmentation results from the two branches are fused adaptively in Point-cluster Fusion.
  • Figure 3: In the MTF module shown in (a), we sequentially fuse the current and historical features on the $x$-$y$, $y$-$z$, and $x$-$z$ planes using the 2D Fusion module illustrated in (b) to integrate the 3D spatial features efficiently.
  • Figure 4: The illustration of cluster label generation. We first leverage voxels to transfer historical semantic predictions to the current points, and then the DBSCAN is used to generate the clusters of foreground objects.
  • Figure 5: Illustration of Adaptive Prediction Fusion (APF) module. We adopt different heads to estimate logits for point features from different branches while combining these two features to compute confidence scores. Then, we perform a weighted sum of logits to generate the final prediction results.
  • ...and 2 more figures