Table of Contents
Fetching ...

HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving

R. D. Lin, Pengcheng Weng, Yinqiao Wang, Han Ding, Jinsong Han, Fei Wang

TL;DR

HiLoTs addresses the challenge of semi-supervised LiDAR segmentation by leveraging long-term temporal dynamics through a High Temporal Sensitivity Flow and a Low Temporal Sensitivity Flow, selectively processing distant and nearby regions and fusing them with cross-attention. The method uses cylindrical voxelization and multi-voxel aggregation to enable efficient Transformer-style embedding within a Mean Teacher SSL framework, achieving state-of-the-art results on SemanticKITTI and nuScenes and approaching LiDAR+Camera multimodal performance without camera data. Ablation confirms the benefits of HTSF/LTSF and cross-attention, and robustness analyses show competitive behavior under adverse conditions, highlighting practical impact for autonomous driving systems.

Abstract

LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlooking the rich long-term temporal properties inherent in autonomous driving scenarios. In driving experience, we observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high-temporal sensitivity and low-temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross-attention mechanism. Additionally, we employ a teacher-student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state-of-the-art semi-supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches. Code is available on https://github.com/rdlin118/HiLoTs

HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving

TL;DR

HiLoTs addresses the challenge of semi-supervised LiDAR segmentation by leveraging long-term temporal dynamics through a High Temporal Sensitivity Flow and a Low Temporal Sensitivity Flow, selectively processing distant and nearby regions and fusing them with cross-attention. The method uses cylindrical voxelization and multi-voxel aggregation to enable efficient Transformer-style embedding within a Mean Teacher SSL framework, achieving state-of-the-art results on SemanticKITTI and nuScenes and approaching LiDAR+Camera multimodal performance without camera data. Ablation confirms the benefits of HTSF/LTSF and cross-attention, and robustness analyses show competitive behavior under adverse conditions, highlighting practical impact for autonomous driving systems.

Abstract

LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlooking the rich long-term temporal properties inherent in autonomous driving scenarios. In driving experience, we observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high-temporal sensitivity and low-temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross-attention mechanism. Additionally, we employ a teacher-student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state-of-the-art semi-supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches. Code is available on https://github.com/rdlin118/HiLoTs

Paper Structure

This paper contains 18 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Different semantic classes exhibit varying degrees of sensitivity to temporal changes. Objects farther from the vehicle (e.g., vegetation, building, person, etc.) change more frequently over time, as indicated by the red box. In contrast, objects closer to the vehicle (e.g., road, sidewalk, etc.) are less sensitive to temporal changes, as shown by the blue box.
  • Figure 2: Our segmentation model involves three stages. During voxelization, cylindrical voxelization is applied to transform unordered points into volumetric grids, followed by a spatial feature extraction backbone. Then, HiLoTs processes the labeled and unlabeled cylindrical features through a student-teacher framework. It also integrates the attention map from HiLoTs embedding unit (HEU) to produce voxel-level segmentation maps. Finally, a point-wise refinement network is utilized to obtain point-level segmentation results.
  • Figure 3: HiLoTs Embedding Unit (HEU). The distant voxel features are passed into high temporal sensitivity flow, while voxel features in closer areas undergo low temporal sensitivity flow. The output map of HEU is fused with the bottleneck feature map from the segmentation model, further passed into the decoder.
  • Figure 4: Error maps visualization (blue and red points are for correct predictions and incorrect predictions, respectively.). The left three columns are segmentation results from SemanticKITTI dataset, while the right three columns are from nuScenes. Our HiLoTs method shows a significant improvement in the area of distant objects.