Table of Contents
Fetching ...

LiDAR-based End-to-end Temporal Perception for Vehicle-Infrastructure Cooperation

Zhenwei Yang, Jilei Mao, Wenxian Yang, Yibo Ai, Yu Kong, Haibao Yu, Weidong Zhang

Abstract

Temporal perception, defined as the capability to detect and track objects across temporal sequences, serves as a fundamental component in autonomous driving systems. While single-vehicle perception systems encounter limitations, stemming from incomplete perception due to object occlusion and inherent blind spots, cooperative perception systems present their own challenges in terms of sensor calibration precision and positioning accuracy. To address these issues, we introduce LET-VIC, a LiDAR-based End-to-End Tracking framework for Vehicle-Infrastructure Cooperation (VIC). First, we employ Temporal Self-Attention and VIC Cross-Attention modules to effectively integrate temporal and spatial information from both vehicle and infrastructure perspectives. Then, we develop a novel Calibration Error Compensation (CEC) module to mitigate sensor misalignment issues and facilitate accurate feature alignment. Experiments on the V2X-Seq-SPD dataset demonstrate that LET-VIC significantly outperforms baseline models. Compared to LET-V, LET-VIC achieves +15.0% improvement in mAP and a +17.3% improvement in AMOTA. Furthermore, LET-VIC surpasses representative Tracking by Detection models, including V2VNet, FFNet, and PointPillars, with at least a +13.7% improvement in mAP and a +13.1% improvement in AMOTA without considering communication delays, showcasing its robust detection and tracking performance. The experiments demonstrate that the integration of multi-view perspectives, temporal sequences, or CEC in end-to-end training significantly improves both detection and tracking performance. All code will be open-sourced.

LiDAR-based End-to-end Temporal Perception for Vehicle-Infrastructure Cooperation

Abstract

Temporal perception, defined as the capability to detect and track objects across temporal sequences, serves as a fundamental component in autonomous driving systems. While single-vehicle perception systems encounter limitations, stemming from incomplete perception due to object occlusion and inherent blind spots, cooperative perception systems present their own challenges in terms of sensor calibration precision and positioning accuracy. To address these issues, we introduce LET-VIC, a LiDAR-based End-to-End Tracking framework for Vehicle-Infrastructure Cooperation (VIC). First, we employ Temporal Self-Attention and VIC Cross-Attention modules to effectively integrate temporal and spatial information from both vehicle and infrastructure perspectives. Then, we develop a novel Calibration Error Compensation (CEC) module to mitigate sensor misalignment issues and facilitate accurate feature alignment. Experiments on the V2X-Seq-SPD dataset demonstrate that LET-VIC significantly outperforms baseline models. Compared to LET-V, LET-VIC achieves +15.0% improvement in mAP and a +17.3% improvement in AMOTA. Furthermore, LET-VIC surpasses representative Tracking by Detection models, including V2VNet, FFNet, and PointPillars, with at least a +13.7% improvement in mAP and a +13.1% improvement in AMOTA without considering communication delays, showcasing its robust detection and tracking performance. The experiments demonstrate that the integration of multi-view perspectives, temporal sequences, or CEC in end-to-end training significantly improves both detection and tracking performance. All code will be open-sourced.

Paper Structure

This paper contains 26 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Vehicle-Infrastructure Cooperative Diagram. The red car represents the Ego Vehicle. The vehicle-side LiDAR covers a semicircular area in front of the Ego Vehicle, while the infrastructure-side LiDAR has a fan-shaped coverage area. Through vehicle-infrastructure cooperation, the infrastructure can provide additional perception information to the vehicle.
  • Figure 2: Vehicle-Infrastructure Cooperative Perception. (a) illustrates perception using only infrastructure-side LiDAR, (b) illustrates perception using only vehicle-side LiDAR, and (c) shows the cooperative perception results combining (a) and (b). The red car represents the Ego Vehicle, green boxes indicate detected objects with their orientations, targets within the red dashed circles are those occluded from the vehicle's view, and targets within the black dashed circles are outside the vehicle's perception range.
  • Figure 3: Architecture of LET-VIC. The diagram illustrates the steps involved in the LET-VIC framework. (a) and (b) Both the infrastructure-side and vehicle-side employ PointPillars lang2019pointpillars to extract LiDAR point cloud features. Then the infrastructure-side features are transmitted to the vehicle side through V2X communication and integrated into the VIC cross-attention module along with the vehicle-side features. (c) The BEV encoder of LET-VIC is inspired by BEVFormer. (d) The decoder layers of LET-VIC employ TrackFormer. (e) The VIC Cross-Attention module fuses the Bird's Eye View (BEV) features from both the infrastructure and vehicle sides, compensating for calibration errors.
  • Figure 4: PointCloud Backbone. The PointCloud Backbone extract Bird's Eye View (BEV) features from point cloud data. The pipeline starts with data preprocessing, voxelization, and feature extraction via PillarFeatureNet. Pseudo-image features are then processed by SECOND and FPN, producing multi-level features for object perception at various scales.
  • Figure 5: VIC Cross-Attention. The VIC Cross-Attention module fuses the Bird's Eye View (BEV) features from both the infrastructure and vehicle sides, compensating for calibration errors. First, We project BEV queries to both the infrastructure side and vehicle side features to get the original reference point. Then, we use a learnable network to predict the calibration compensation to refine the original reference point. Finally, we use deformable attention sampling around the corrected reference point to generate the respective BEV feature.
  • ...and 2 more figures