Table of Contents
Fetching ...

RayFusion: Ray Fusion Enhanced Collaborative Visual Perception

Shaohong Wang, Bin Lu, Xinyu Xiao, Hanzhi Zhong, Bowen Pang, Tong Wang, Zhiyu Xiang, Hangguan Shan, Eryun Liu

TL;DR

RayFusion is proposed, a ray-based fusion method for collaborative visual perception that reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems.

Abstract

Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e.g., 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. The code is available at https://github.com/wangsh0111/RayFusion.

RayFusion: Ray Fusion Enhanced Collaborative Visual Perception

TL;DR

RayFusion is proposed, a ray-based fusion method for collaborative visual perception that reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems.

Abstract

Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e.g., 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. The code is available at https://github.com/wangsh0111/RayFusion.

Paper Structure

This paper contains 23 sections, 15 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Due to the ambiguity in depth estimation, individual agent typically predicts multiple targets along the camera ray. However, when an instance is observed by multiple agents, they can record the target's occupancy information along the ray and cross-validate its true 3D position.
  • Figure 2: Overall architecture of RayFusion. i) The single-agent detector and the collaborative message generation module generate instance information for communication and collaboration; ii) The spatial-temporal alignment module enhances system robustness to latency by modeling motion; iii) The ray occupancy information encoding module leverages multi-view information to mitigate depth estimation ambiguity; iv) The multi-scale instance feature aggregation facilitates effective interaction among instance features, promoting comprehensive and precise collaborative perception.
  • Figure 2: Ablation study results on the OPV2V and DAIR-V2X datasets. STA, MIFA, ROE represent: i) spatial-temporal alignment, ii) multi-scale instance feature aggregation, and iii) ray occupancy information encoding, respectively. IFA replaces the pyramid window self-attention in MIFA with vanilla multi-head self-attention.
  • Figure 3: Robustness to localization errors and communication delays on the V2XSet dataset.
  • Figure 3: Analysis of different components in ROE on the OPV2V and DAIR-V2X. RE, OE represent: i) ray encoding and ii) occupancy information encoding, respectively. WH represents the removal of high-dimensional mapping in ray encoding.
  • ...and 3 more figures