Table of Contents
Fetching ...

SeSame: Simple, Easy 3D Object Detection with Point-Wise Semantics

Hayeon O, Chanuk Yang, Kunsoo Huh

TL;DR

This work tackles the limited semantic context in LiDAR-only 3D object detectors by injecting per-point semantics extracted from LiDAR semantic segmentation into existing detectors, without requiring camera–LiDAR calibration. The SeSame pipeline uses Cylinder3D to generate per-point labels, maps them to KITTI classes, and concatenates semantic features with raw point coordinates to augment PointRCNN, SECOND, and PointPillar-based detectors. Evaluations on KITTI show consistent improvements over baselines and several multimodal methods across BEV and 3D detection metrics, particularly for car detections. Ablation analyses reveal that one-hot label encodings are more robust than soft scores, and analysis indicates images excel for sparse objects while LiDAR semantics better handle occlusion for larger objects, highlighting complementary strengths. The approach demonstrates the viability of calibration-free semantic augmentation for LiDAR-only detection, though it relies on semantic annotations; future work targets self-supervised multimodal semantic segmentation as a pretext task to reduce labeling costs.

Abstract

In autonomous driving, 3D object detection provides more precise information for downstream tasks, including path planning and motion estimation, compared to 2D object detection. In this paper, we propose SeSame: a method aimed at enhancing semantic information in existing LiDAR-only based 3D object detection. This addresses the limitation of existing 3D detectors, which primarily focus on object presence and classification, thus lacking in capturing relationships between elemental units that constitute the data, akin to semantic segmentation. Experiments demonstrate the effectiveness of our method with performance improvements on the KITTI object detection benchmark. Our code is available at https://github.com/HAMA-DL-dev/SeSame

SeSame: Simple, Easy 3D Object Detection with Point-Wise Semantics

TL;DR

This work tackles the limited semantic context in LiDAR-only 3D object detectors by injecting per-point semantics extracted from LiDAR semantic segmentation into existing detectors, without requiring camera–LiDAR calibration. The SeSame pipeline uses Cylinder3D to generate per-point labels, maps them to KITTI classes, and concatenates semantic features with raw point coordinates to augment PointRCNN, SECOND, and PointPillar-based detectors. Evaluations on KITTI show consistent improvements over baselines and several multimodal methods across BEV and 3D detection metrics, particularly for car detections. Ablation analyses reveal that one-hot label encodings are more robust than soft scores, and analysis indicates images excel for sparse objects while LiDAR semantics better handle occlusion for larger objects, highlighting complementary strengths. The approach demonstrates the viability of calibration-free semantic augmentation for LiDAR-only detection, though it relies on semantic annotations; future work targets self-supervised multimodal semantic segmentation as a pretext task to reduce labeling costs.

Abstract

In autonomous driving, 3D object detection provides more precise information for downstream tasks, including path planning and motion estimation, compared to 2D object detection. In this paper, we propose SeSame: a method aimed at enhancing semantic information in existing LiDAR-only based 3D object detection. This addresses the limitation of existing 3D detectors, which primarily focus on object presence and classification, thus lacking in capturing relationships between elemental units that constitute the data, akin to semantic segmentation. Experiments demonstrate the effectiveness of our method with performance improvements on the KITTI object detection benchmark. Our code is available at https://github.com/HAMA-DL-dev/SeSame
Paper Structure (15 sections, 1 equation, 5 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 1 equation, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: (left) This depicts a scenario in which two objects, pedestrian and car, are overlapping, causing occlusion. (center) For the pixel-wise segmentationdeeplabv3 projected onto the point cloud using perspective projection,,miss-segmentation occurs in which some of the semantic features of the pedestrian (blue) are classified as car (green). (right) On the other hand, it may be seen that point-wise semantic segmentation has higher accuracy.
  • Figure 2: (left)Due to the error-sensitive hard association based on the LiDAR-camera calibration matrix, some points were not projected onto reflective surfaces. (right)Additionally, many image features did not correspond to the point cloud. it is stated that less than 5% of image features match with point clouds (for a 32-channel LiDAR scanner)BEVFusion
  • Figure 3: (up) overall architecture of LiDAR sem.seg.Cylinder3D implemented in this paper. It returns per point semantics. (down-left) The input point clouds are augmented with the semantics. (down-right) 3D detectors with various feature extractors ; point, voxel, pillar. OpenPCDetframework:pcdet supports various models within a single framework, with the green sections indicating the components used by each model.
  • Figure 4: The leftmost of the four sections represents the ground truth(GT), while the remaining three sections depict predictions from detectors based on input features of pointPointRCNN, voxelSECOND, and pillarPointPillar. The top figure illustrates a scenario with multiple cars(blue) and some cyclists(red), while the bottom figure shows multiple pedestrians(green).
  • Figure 5: Qualitative results on the KITTI test set. There are two scenes. For each scene, the results of SeSame+point, +voxel, and +pillar are shown from leftmost to rightmost.