OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction

Xianke Wu; Songlin Bai; Chengxiang Li; Zhiyao Luo; Yulin Tian; Fenghua Zhu; Yisheng Lv; Yonglin Tian

OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction

Xianke Wu, Songlin Bai, Chengxiang Li, Zhiyao Luo, Yulin Tian, Fenghua Zhu, Yisheng Lv, Yonglin Tian

Abstract

While Vehicle-to-Vehicle (V2V) collaboration extends sensing ranges through multi-agent data sharing, its reliability remains severely constrained by ground-level occlusions and the limited perspective of chassis-mounted sensors, which often result in critical perception blind spots. We propose OpenCOOD-Air, a novel framework that integrates UAVs as extensible platforms into V2V collaborative perception to overcome these constraints. To mitigate gradient interference from ground-air domain gaps and data sparsity, we adopt a transfer learning strategy to fine-tune UAV weights from pre-trained V2V models. To prevent the spatial information loss inherent in this transition, we formulate ground-air collaborative perception as a heterogeneous integration task with explicit altitude supervision and introduce a Cross-Domain Spatial Converter (CDSC) and a Spatial Offset Prediction Transformer (SOPT). Furthermore, we present the OPV2V-Air benchmark to validate the transition from V2V to Vehicle-to-Vehicle-to-UAV. Compared to state-of-the-art methods, our approach improves 2D and 3D AP@0.7 by 4% and 7%, respectively.

OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction

Abstract

Paper Structure (28 sections, 9 equations, 7 figures, 5 tables)

This paper contains 28 sections, 9 equations, 7 figures, 5 tables.

Introduction
Related Work
Method
Overview and Motivation
Cross-Domain Spatial Converter (CDSC)
Resolution Rescaling
Cross-Domain Feature Mapping
SOPT: Spatial Offset Prediction Transformer
Feature Projection and Tokenization
Global Context Encoding
Offset Regression
Joint Optimization Objectives
Feature Offset Supervision
Height Calibration and Vertical Consistency Loss
Experiment
...and 13 more sections

Figures (7)

Figure 1: Overview of the OpenCOOD-Air framework and its motivation. (1) Naive Collaborative Training suffers from convergence difficulty due to the significant domain gap between ground vehicles and UAVs. (2) Transfer Learning from pre-trained V2V models stabilizes training but introduces spatial misalignment, as it fails to capture the unique altitude dimension of UAVs. (3) Our Method addresses these issues by inheriting V2V knowledge and employing a CDSC module for feature alignment, alongside a SOPT module to explicitly supervise and rectify altitude-induced geometric discrepancies.
Figure 2: Overview of the proposed OpenCOOD-Air framework. The architecture is implemented in three steps: (1) training base model for vehicle-to-vehicle collaboration; (2) utilizing the CDSC module and SOPT module to align aerial-ground perspectives; (3) merging trained weights and integrating V2V and V2U features for final ground-air collaborative perception.
Figure 3: Detailed architectures of the proposed modules.(a) CDSC performs spatial alignment via bilinear interpolation and ConvNeXt-based geometric rectification; (b) SOPT utilizes Transformer encoders to regress the 3D spatial offset $\Delta \mathbf{p}$ for explicit spatial supervision.
Figure 4: Camera data examples of the OPV2V-Air dataset. The white circle highlight the position of Vehicle 1 and Vehicle 2 in the data of UAV.
Figure 5: Robust Experiment to pose error. Pose noise is set to $\mathcal{N}(0, \sigma_p^2)$ on x, y location and $\mathcal{N}(0, \sigma_r^2)$ on yaw angle.
...and 2 more figures

OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction

Abstract

OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction

Authors

Abstract

Table of Contents

Figures (7)