Table of Contents
Fetching ...

Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

Yunting Xu, Jiacheng Wang, Ruichen Zhang, Changyuan Zhao, Yinqiu Liu, Dusit Niyato, Liang Yu, Haibo Zhou, Dong In Kim

Abstract

Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird's-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.

Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

Abstract

Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird's-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.
Paper Structure (25 sections, 62 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 62 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of multi-UAV cooperative perception scenario. The UAV-captured aerial images are transmitted to a ground server via MU-MIMO communication, where multi-view information is fused to enable cooperative feature learning for vehicle instance and motion detection in low-altitude economy scenarios.
  • Figure 2: Illustration of the proposed BHU framework for multi-UAV cooperative perception. The aerial images are first sparsified via a Top-K selection mechanism and transmitted to the ground server through wireless links. A MaskDINO-based encoder extracts multi-scale features, which are projected into BEV representations and cooperatively fused across multi-UAVs. The fused BEV features are employed for downstream perception tasks, including vehicle instance segmentation and motion flow prediction.
  • Figure 3: Illustration of the proposed DDIM-based DRL framework. An MLP network is employed to determine the cooperative UAVs and Top-K sparsification ratios, while a DDIM module is employed to generate precoding actions by modeling the conditional distribution of optimal precoding vectors through a reverse denoising process.
  • Figure 4: Visualization of 3D object detection results in the BEV representation.
  • Figure 5: The training loss of segmentation and predicted instance flow.
  • ...and 7 more figures