Table of Contents
Fetching ...

TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation

Rong Li, ShiJie Li, Xieyuanli Chen, Teli Ma, Juergen Gall, Junwei Liang

TL;DR

TFNet tackles the many-to-one boundary problem in range-image LiDAR semantic segmentation by introducing a Temporal Cross-Attention (TCA) module to fuse features from previous scans and a Max-Voting Post-Processing (MVP) step to refine predictions during inference. The approach projects LiDAR frames to range images, extracts multi-scale features, and uses TCA to integrate temporal context, while MVP aligns past predictions in a common frame and performs voxel-wise max-voting. Experiments on SemanticKITTI and SemanticPOSS show that TFNet achieves state-of-the-art performance among range-image methods, with MVP providing consistent gains across backbones and maintaining real-time inference. This work demonstrates that temporal coherence effectively resolves occlusions and projection ambiguities, offering a practical, plug-in improvement for LiDAR-based semantic segmentation in autonomous driving settings.

Abstract

LiDAR semantic segmentation plays a crucial role in enabling autonomous driving and robots to understand their surroundings accurately and robustly. A multitude of methods exist within this domain, including point-based, range-image-based, polar-coordinate-based, and hybrid strategies. Among these, range-image-based techniques have gained widespread adoption in practical applications due to their efficiency. However, they face a significant challenge known as the ``many-to-one'' problem caused by the range image's limited horizontal and vertical angular resolution. As a result, around 20% of the 3D points can be occluded. In this paper, we present TFNet, a range-image-based LiDAR semantic segmentation method that utilizes temporal information to address this issue. Specifically, we incorporate a temporal fusion layer to extract useful information from previous scans and integrate it with the current scan. We then design a max-voting-based post-processing technique to correct false predictions, particularly those caused by the ``many-to-one'' issue. We evaluated the approach on two benchmarks and demonstrated that the plug-in post-processing technique is generic and can be applied to various networks.

TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation

TL;DR

TFNet tackles the many-to-one boundary problem in range-image LiDAR semantic segmentation by introducing a Temporal Cross-Attention (TCA) module to fuse features from previous scans and a Max-Voting Post-Processing (MVP) step to refine predictions during inference. The approach projects LiDAR frames to range images, extracts multi-scale features, and uses TCA to integrate temporal context, while MVP aligns past predictions in a common frame and performs voxel-wise max-voting. Experiments on SemanticKITTI and SemanticPOSS show that TFNet achieves state-of-the-art performance among range-image methods, with MVP providing consistent gains across backbones and maintaining real-time inference. This work demonstrates that temporal coherence effectively resolves occlusions and projection ambiguities, offering a practical, plug-in improvement for LiDAR-based semantic segmentation in autonomous driving settings.

Abstract

LiDAR semantic segmentation plays a crucial role in enabling autonomous driving and robots to understand their surroundings accurately and robustly. A multitude of methods exist within this domain, including point-based, range-image-based, polar-coordinate-based, and hybrid strategies. Among these, range-image-based techniques have gained widespread adoption in practical applications due to their efficiency. However, they face a significant challenge known as the ``many-to-one'' problem caused by the range image's limited horizontal and vertical angular resolution. As a result, around 20% of the 3D points can be occluded. In this paper, we present TFNet, a range-image-based LiDAR semantic segmentation method that utilizes temporal information to address this issue. Specifically, we incorporate a temporal fusion layer to extract useful information from previous scans and integrate it with the current scan. We then design a max-voting-based post-processing technique to correct false predictions, particularly those caused by the ``many-to-one'' issue. We evaluated the approach on two benchmarks and demonstrated that the plug-in post-processing technique is generic and can be applied to various networks.
Paper Structure (13 sections, 3 equations, 6 figures, 5 tables)

This paper contains 13 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Range-image-based methods suffer from the "many-to-one" problem where multiple 3D points with the same angle are mapped to a single range pixel. Marked by the red circles of frame $t_0$, this can cause distant terrain points (purple) to receive erroneous predictions from nearby billboard points (blue) when the range image is re-projected to 3D. Furthermore, occluded points in frame $t_0$ become visible in $t_1$, offering an opportunity to refine the predictions.
  • Figure 2: Architecture of TFNet. For a point cloud $P_t$, TFNet projects it onto range images $I_t$. It then uses a segmentation backbone to extract multi-scale features $\{F_t\}_{1:l}$, a Temporal Cross-Attention (TCA) layer to integrate past features $\{F_{t-1}\}_{1:l}$, and a segmentation head to predict range-image-based logits $O_t$. In inference, it refines the re-projected prediction ${S_t}$ by aggregating the current and past temporal predictions $\{S\}_{1:t}$ by a Max-Voting-based Post-processing (MVP) strategy.
  • Figure 3: Illustration of the max voting post-processing strategy.
  • Figure 4: Effect of window size and grid size resolution.
  • Figure 5: Qualitative analysis of the post-processing scheme. (a) The "many-to-one" issue is evident without post-processing, e.g., the trunk is partially segmented as traffic sign and vegetation as they project onto the same range pixel (row 2). (b) k-NN rangenet smooths the semantic labels locally, but it cannot resolve ambiguities by objects that are close or prediction errors. (c) Our method exploits temporal information to resolve false predictions (row 1) or ambiguities due to occlusions (row 2). Best viewed in color.
  • ...and 1 more figures