Table of Contents
Fetching ...

Learning better representations for crowded pedestrians in offboard LiDAR-camera 3D tracking-by-detection

Shichao Li, Peiliang Li, Qing Lian, Peng Yun, Xiaozhi Chen

TL;DR

The paper tackles the challenge of perceiving and tracking crowded pedestrians for autonomous driving by proposing an offboard 3D MOT framework and a dedicated multi-view benchmark PCP-MV. It introduces three key innovations: density-aware weighting to focus learning on crowded regions, relationship-aware targets to discriminate adjacent pedestrians with sparse LiDAR data, and high-resolution sparse representations to better detect small objects. Together with an offboard BEVFusion-based tracking-by-detection backbone, these methods yield substantial improvements in MOTA on PCP-MV (up to 0.353, from a 0.172 baseline) and demonstrate strong generalization on nuScenes. The work also delivers a publicly available dataset and code, enabling faster, more accurate auto-labeling and improved training data quality for crowded urban perception tasks.

Abstract

Perceiving pedestrians in highly crowded urban environments is a difficult long-tail problem for learning-based autonomous perception. Speeding up 3D ground truth generation for such challenging scenes is performance-critical yet very challenging. The difficulties include the sparsity of the captured pedestrian point cloud and a lack of suitable benchmarks for a specific system design study. To tackle the challenges, we first collect a new multi-view LiDAR-camera 3D multiple-object-tracking benchmark of highly crowded pedestrians for in-depth analysis. We then build an offboard auto-labeling system that reconstructs pedestrian trajectories from LiDAR point cloud and multi-view images. To improve the generalization power for crowded scenes and the performance for small objects, we propose to learn high-resolution representations that are density-aware and relationship-aware. Extensive experiments validate that our approach significantly improves the 3D pedestrian tracking performance towards higher auto-labeling efficiency. The code will be publicly available at this HTTP URL.

Learning better representations for crowded pedestrians in offboard LiDAR-camera 3D tracking-by-detection

TL;DR

The paper tackles the challenge of perceiving and tracking crowded pedestrians for autonomous driving by proposing an offboard 3D MOT framework and a dedicated multi-view benchmark PCP-MV. It introduces three key innovations: density-aware weighting to focus learning on crowded regions, relationship-aware targets to discriminate adjacent pedestrians with sparse LiDAR data, and high-resolution sparse representations to better detect small objects. Together with an offboard BEVFusion-based tracking-by-detection backbone, these methods yield substantial improvements in MOTA on PCP-MV (up to 0.353, from a 0.172 baseline) and demonstrate strong generalization on nuScenes. The work also delivers a publicly available dataset and code, enabling faster, more accurate auto-labeling and improved training data quality for crowded urban perception tasks.

Abstract

Perceiving pedestrians in highly crowded urban environments is a difficult long-tail problem for learning-based autonomous perception. Speeding up 3D ground truth generation for such challenging scenes is performance-critical yet very challenging. The difficulties include the sparsity of the captured pedestrian point cloud and a lack of suitable benchmarks for a specific system design study. To tackle the challenges, we first collect a new multi-view LiDAR-camera 3D multiple-object-tracking benchmark of highly crowded pedestrians for in-depth analysis. We then build an offboard auto-labeling system that reconstructs pedestrian trajectories from LiDAR point cloud and multi-view images. To improve the generalization power for crowded scenes and the performance for small objects, we propose to learn high-resolution representations that are density-aware and relationship-aware. Extensive experiments validate that our approach significantly improves the 3D pedestrian tracking performance towards higher auto-labeling efficiency. The code will be publicly available at this HTTP URL.

Paper Structure

This paper contains 15 sections, 7 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: A comparison of the pedestrian density with varying circle radii.
  • Figure 2: An example frame of PCP-MV. Left: Bird's Eye View of the captured point cloud and box annotations. Right: six input camera views. Pedestrian annotations are shown in blue, and one can notice the high bounding box density. The data collection vehicle has four cameras, including two surrounding-view (SV) fisheye cameras. The fisheye images are rectified to produce four SV input pinhole images to our system. Best viewed in color and zoom in for more details.
  • Figure 3: Diagram of our offboard system for 3D MOT of crowded pedestrians. Our proposed representation learning approaches are highlighted with pink rectangles. A Swin Transformer liu2021swin extracts MV camera features. A uniform voxel grid is built in 3D space, where each voxel gathers features at its projected location on the images. The visual voxel features are pooled in the height dimension to form a BEV feature. The point cloud inputs are voxelized and processed by sparse convolution layers. A CNN fuses LiDAR and camera BEV features, and a head module predicts object attributes and offsets used for tracking. To enhance representation learning for crowded pedestrians, we further propose the density-aware loss, the relationship offset targets, and the high-resolution sparse feature learning module.
  • Figure 4: (a) A ground truth heatmap and (b) the density-aware weights used for computing the focal loss. The weights are larger in spatial regions where more objects appear.
  • Figure 5: Ground truth relationship offset targets shown as arrows. The relationship offset targets differ for adjacent objects, forcing the model to recognize different instances.
  • ...and 3 more figures