Table of Contents
Fetching ...

UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection

Yuchao Wang, Peirui Cheng, Pengju Tian, Ziyang Yuan, Liangjin Zhao, Jing Tian, Wensheng Wang, Zhirui Wang, Xian Sun

TL;DR

This work tackles the problem of 3D object detection in aerial-ground collaborative perception, where disparities in views and depth accuracy hinder effective fusion. It introduces UVCPNet, a BEV-based framework that uses a Cross-Domain Cross-Adaptation (CDCA) module to align multi-domain BEV features and a Collaborative Depth Optimization (CDO) module that refines depth via CRF-guided contextual information without extra supervision. A new synthetic V2U-COO dataset is developed to study air-to-ground cooperation, and extensive experiments on V2U-COO and DAIR-V2X show that UVCPNet delivers substantial gains in mAP (approximately 6.1% on V2U-COO and 2.7% on DAIR-V2X) compared with single-agent baselines and other BEV-based methods. Overall, the approach demonstrates that cross-domain feature alignment and depth-aware BEV fusion can significantly enhance 3D perception in heterogeneous multi-agent systems, with practical implications for robust autonomous sensing in mixed aerial-ground environments.

Abstract

With the advancement of collaborative perception, the role of aerial-ground collaborative perception, a crucial component, is becoming increasingly important. The demand for collaborative perception across different perspectives to construct more comprehensive perceptual information is growing. However, challenges arise due to the disparities in the field of view between cross-domain agents and their varying sensitivity to information in images. Additionally, when we transform image features into Bird's Eye View (BEV) features for collaboration, we need accurate depth information. To address these issues, we propose a framework specifically designed for aerial-ground collaboration. First, to mitigate the lack of datasets for aerial-ground collaboration, we develop a virtual dataset named V2U-COO for our research. Second, we design a Cross-Domain Cross-Adaptation (CDCA) module to align the target information obtained from different domains, thereby achieving more accurate perception results. Finally, we introduce a Collaborative Depth Optimization (CDO) module to obtain more precise depth estimation results, leading to more accurate perception outcomes. We conduct extensive experiments on both our virtual dataset and a public dataset to validate the effectiveness of our framework. Our experiments on the V2U-COO dataset and the DAIR-V2X dataset demonstrate that our method improves detection accuracy by 6.1% and 2.7%, respectively.

UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection

TL;DR

This work tackles the problem of 3D object detection in aerial-ground collaborative perception, where disparities in views and depth accuracy hinder effective fusion. It introduces UVCPNet, a BEV-based framework that uses a Cross-Domain Cross-Adaptation (CDCA) module to align multi-domain BEV features and a Collaborative Depth Optimization (CDO) module that refines depth via CRF-guided contextual information without extra supervision. A new synthetic V2U-COO dataset is developed to study air-to-ground cooperation, and extensive experiments on V2U-COO and DAIR-V2X show that UVCPNet delivers substantial gains in mAP (approximately 6.1% on V2U-COO and 2.7% on DAIR-V2X) compared with single-agent baselines and other BEV-based methods. Overall, the approach demonstrates that cross-domain feature alignment and depth-aware BEV fusion can significantly enhance 3D perception in heterogeneous multi-agent systems, with practical implications for robust autonomous sensing in mixed aerial-ground environments.

Abstract

With the advancement of collaborative perception, the role of aerial-ground collaborative perception, a crucial component, is becoming increasingly important. The demand for collaborative perception across different perspectives to construct more comprehensive perceptual information is growing. However, challenges arise due to the disparities in the field of view between cross-domain agents and their varying sensitivity to information in images. Additionally, when we transform image features into Bird's Eye View (BEV) features for collaboration, we need accurate depth information. To address these issues, we propose a framework specifically designed for aerial-ground collaboration. First, to mitigate the lack of datasets for aerial-ground collaboration, we develop a virtual dataset named V2U-COO for our research. Second, we design a Cross-Domain Cross-Adaptation (CDCA) module to align the target information obtained from different domains, thereby achieving more accurate perception results. Finally, we introduce a Collaborative Depth Optimization (CDO) module to obtain more precise depth estimation results, leading to more accurate perception outcomes. We conduct extensive experiments on both our virtual dataset and a public dataset to validate the effectiveness of our framework. Our experiments on the V2U-COO dataset and the DAIR-V2X dataset demonstrate that our method improves detection accuracy by 6.1% and 2.7%, respectively.
Paper Structure (23 sections, 14 equations, 10 figures, 8 tables)

This paper contains 23 sections, 14 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Left images and represent the scene images observed from the UAV's perspective, while the right images and respectively depict the vehicle images annotated in left images. It can be observed that there is a significant disparity in the observed field of view between the aerial and ground domains.
  • Figure 2: Overview of the proposed framework. The whole collaborative inference process can be divided into four parts: 1) feature extraction: each agent extracts the features of the input image; 2) depth optimization: get accurate depth value through collaborative optimization; 3) BEV feature fused: get aligned and enhanced Bev feature map; and 4) 3D Detection: detect 3D objects.
  • Figure 3: Schematic diagram of CDCA module. It is used to align the obtained Bev feature map and fuse the Bev information at the same time.
  • Figure 4: The specific category information contained in V2U-COO dataset.
  • Figure 5: Example diagram of multiple scenarios for v2u-coo dataset.
  • ...and 5 more figures