Table of Contents
Fetching ...

SpotNet: An Image Centric, Lidar Anchored Approach To Long Range Perception

Louis Foucard, Samar Khanna, Yi Shi, Chi-Kuei Liu, Quinn Z Shen, Thuyen Ngo, Zi-Xiang Xia

TL;DR

SpotNet tackles long-range 3D perception for heavy vehicles by fusing high-resolution camera imagery with LiDAR-derived range anchoring in a single-stage, range-view RGB-D framework. Detections are anchored to LiDAR points, with LiDAR projected into a reduced-resolution image to form a sparse depth raster that is fused at multiple network stages, and 2D/3D predictions are jointly supervised in image space using a Laplacian-based likelihood. Experiments on the Aurora Long Range Dataset show SpotNet surpassing lidar-centric BEV and image-centric baselines, with notable gains from 2MP to 8MP imagery and a training-on-2MP, testing-on-8MP strategy that preserves depth density. The approach delivers efficient, real-time-like inference while maintaining accuracy at 100–500 m, highlighting the practicality of LiDAR-anchored, image-rich long-range perception for autonomous trucking.

Abstract

In this paper, we propose SpotNet: a fast, single stage, image-centric but LiDAR anchored approach for long range 3D object detection. We demonstrate that our approach to LiDAR/image sensor fusion, combined with the joint learning of 2D and 3D detection tasks, can lead to accurate 3D object detection with very sparse LiDAR support. Unlike more recent bird's-eye-view (BEV) sensor-fusion methods which scale with range $r$ as $O(r^2)$, SpotNet scales as $O(1)$ with range. We argue that such an architecture is ideally suited to leverage each sensor's strength, i.e. semantic understanding from images and accurate range finding from LiDAR data. Finally we show that anchoring detections on LiDAR points removes the need to regress distances, and so the architecture is able to transfer from 2MP to 8MP resolution images without re-training.

SpotNet: An Image Centric, Lidar Anchored Approach To Long Range Perception

TL;DR

SpotNet tackles long-range 3D perception for heavy vehicles by fusing high-resolution camera imagery with LiDAR-derived range anchoring in a single-stage, range-view RGB-D framework. Detections are anchored to LiDAR points, with LiDAR projected into a reduced-resolution image to form a sparse depth raster that is fused at multiple network stages, and 2D/3D predictions are jointly supervised in image space using a Laplacian-based likelihood. Experiments on the Aurora Long Range Dataset show SpotNet surpassing lidar-centric BEV and image-centric baselines, with notable gains from 2MP to 8MP imagery and a training-on-2MP, testing-on-8MP strategy that preserves depth density. The approach delivers efficient, real-time-like inference while maintaining accuracy at 100–500 m, highlighting the practicality of LiDAR-anchored, image-rich long-range perception for autonomous trucking.

Abstract

In this paper, we propose SpotNet: a fast, single stage, image-centric but LiDAR anchored approach for long range 3D object detection. We demonstrate that our approach to LiDAR/image sensor fusion, combined with the joint learning of 2D and 3D detection tasks, can lead to accurate 3D object detection with very sparse LiDAR support. Unlike more recent bird's-eye-view (BEV) sensor-fusion methods which scale with range as , SpotNet scales as with range. We argue that such an architecture is ideally suited to leverage each sensor's strength, i.e. semantic understanding from images and accurate range finding from LiDAR data. Finally we show that anchoring detections on LiDAR points removes the need to regress distances, and so the architecture is able to transfer from 2MP to 8MP resolution images without re-training.
Paper Structure (14 sections, 7 equations, 4 figures, 3 tables)

This paper contains 14 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overall architecture diagram. The figure above shows the input RGB-D data as well as the output maps for each of the category, 2D regression, and 3D regression head. The lidar data is rasterized and used both in early and late sensor fusion.
  • Figure 2: 2D and 3D target encoding. On the left, the 2D labels and projected 3D labels are shown in the image frame in pixel space, along with projected foreground (red) lidar points and background (black) lidar points. On the right, the 3D label is shows in top down view in 3D camera frame, along with foreground lidar points.
  • Figure 3: Range based models (SpotNet and CenterNet) enjoy fixed inference times regardless of operating ranges whereas inference times of BEV methods increase with the size of the BEV feature map. The data here was collected on an NVIDIA A10 GPU, where all models were executed with 16-bit floating point precision. Best viewed in color.
  • Figure 4: Examples detections on our validation dataset, out to ranges of 450m, anchored on FMCW lidar points. The furthest detections only have 1-2 lidar points. Colors of the bounding boxes of detections: vehicle, pedestrian, construction; all the linked 2d/3d labels are displayed in white.