Table of Contents
Fetching ...

Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection

Minseung Lee, Seokha Moon, Seung Joon Lee, Reza Mahjourian, Jinkyu Kim

TL;DR

LiDAR sparsity hampers reliable 3D object detection, especially for small or distant objects. The authors propose ImagePG, a framework that generates dense, semantically meaningful pseudo-LiDAR points by fusing RGB image semantics with LiDAR through IG-RPG, I-OPN, and MR, guided by deformable attention and BEV priors. The approach yields substantial improvements on KITTI and Waymo, notably dramatically reducing false positives and achieving state-of-the-art cyclist detection on KITTI, while remaining compatible with multiple backbones. Overall, ImagePG demonstrates robust cross-dataset performance and offers a practical, modality-aware enhancement for multi-modal 3D perception in autonomous driving.

Abstract

In autonomous driving scenarios, accurate perception is becoming an even more critical task for safe navigation. While LiDAR provides precise spatial data, its inherent sparsity makes it difficult to detect small or distant objects. Existing methods try to address this by generating additional points within a Region of Interest (RoI), but relying on LiDAR alone often leads to false positives and a failure to recover meaningful structures. To address these limitations, we propose Image-Guided Semantic Pseudo-LiDAR Point Generation model, called ImagePG, a novel framework that leverages rich RGB image features to generate dense and semantically meaningful 3D points. Our framework includes an Image-Guided RoI Points Generation (IG-RPG) module, which creates pseudo-points guided by image features, and an Image-Aware Occupancy Prediction Network (I-OPN), which provides spatial priors to guide point placement. A multi-stage refinement (MR) module further enhances point quality and detection robustness. To the best of our knowledge, ImagePG is the first method to directly leverage image features for point generation. Extensive experiments on the KITTI and Waymo datasets demonstrate that ImagePG significantly improves the detection of small and distant objects like pedestrians and cyclists, reducing false positives by nearly 50%. On the KITTI benchmark, our framework improves mAP by +1.38%p (car), +7.91%p (pedestrian), and +5.21%p (cyclist) on the test set over the baseline, achieving state-of-the-art cyclist performance on the KITTI leaderboard. The code is available at: https://github.com/MS-LIMA/ImagePG

Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection

TL;DR

LiDAR sparsity hampers reliable 3D object detection, especially for small or distant objects. The authors propose ImagePG, a framework that generates dense, semantically meaningful pseudo-LiDAR points by fusing RGB image semantics with LiDAR through IG-RPG, I-OPN, and MR, guided by deformable attention and BEV priors. The approach yields substantial improvements on KITTI and Waymo, notably dramatically reducing false positives and achieving state-of-the-art cyclist detection on KITTI, while remaining compatible with multiple backbones. Overall, ImagePG demonstrates robust cross-dataset performance and offers a practical, modality-aware enhancement for multi-modal 3D perception in autonomous driving.

Abstract

In autonomous driving scenarios, accurate perception is becoming an even more critical task for safe navigation. While LiDAR provides precise spatial data, its inherent sparsity makes it difficult to detect small or distant objects. Existing methods try to address this by generating additional points within a Region of Interest (RoI), but relying on LiDAR alone often leads to false positives and a failure to recover meaningful structures. To address these limitations, we propose Image-Guided Semantic Pseudo-LiDAR Point Generation model, called ImagePG, a novel framework that leverages rich RGB image features to generate dense and semantically meaningful 3D points. Our framework includes an Image-Guided RoI Points Generation (IG-RPG) module, which creates pseudo-points guided by image features, and an Image-Aware Occupancy Prediction Network (I-OPN), which provides spatial priors to guide point placement. A multi-stage refinement (MR) module further enhances point quality and detection robustness. To the best of our knowledge, ImagePG is the first method to directly leverage image features for point generation. Extensive experiments on the KITTI and Waymo datasets demonstrate that ImagePG significantly improves the detection of small and distant objects like pedestrians and cyclists, reducing false positives by nearly 50%. On the KITTI benchmark, our framework improves mAP by +1.38%p (car), +7.91%p (pedestrian), and +5.21%p (cyclist) on the test set over the baseline, achieving state-of-the-art cyclist performance on the KITTI leaderboard. The code is available at: https://github.com/MS-LIMA/ImagePG
Paper Structure (19 sections, 5 equations, 11 figures, 15 tables)

This paper contains 19 sections, 5 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: (a) Conventional LiDAR-only methods for generating point clouds. (b) Our image-guided point cloud generation approach leverages visual semantic information (from RGB images) to enhance the density of point clouds, thereby improving overall 3D perception performance.
  • Figure 2: An overview of our proposed ImagePG architecture. (i) The input point cloud undergoes multiple geometric transformations, and corresponding features are extracted alongside image features. (ii) Initial region proposals are generated by RPN in conjunction with I-OPN. (iii) These proposals are fed into the IG-RPG module to generate semantically enriched points, which are then used by the detection head to predict bounding boxes. The predicted boxes are passed to the next refinement stage. (iv) Final bounding boxes are obtained through a box voting mechanism that aggregates multi-stage predictions.
  • Figure 3: Illustration of our proposed IG-RPG and the detection head. (i) Each grid point is projected onto the image plane, where deformable attention is applied to sample corresponding image features. These sampled features are fused with the voxel features $F_{g_i}^{vox}$, enabling the generation of semantically guided points. (ii) The generated points are subsequently encoded using a point encoder and passed to the detection head, which predicts 3D bounding boxes and classification scores.
  • Figure 4: Illustration of our proposed I-OPN. (i) Raw point clouds are encoded with a pillar-based backbone pointpillars to obtain pillar features $F_i^{\text{pill}}$. Each pillar center is projected onto the image plane, deformable attention enriches $F_i^{\text{pill}}$ with sampled image cues to yield $F_i^{\text{spill}}$, and occupancy heatmap is predicted. (ii) From the multi-transformed LiDAR inputs, a shared 3D backbone extracts BEV features. (iii) The occupancy heatmap is concatenated with the BEV features, and the combined representation is fed into the RPN.
  • Figure 5: Qualitative results of semantic point generation and detection for baseline pgrcnn and ours. The baseline pgrcnn often generates incorrect points, whereas our approach suppresses such errors, leading to improved detection performance.
  • ...and 6 more figures