AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning
Jirong Zha, Yuxuan Fan, Tianyu Zhang, Geng Chen, Yingfeng Chen, Chen Gao, Xinlei Chen
TL;DR
AirCopBench addresses the absence of benchmarks for multi-UAV embodied perception by providing a large-scale, degraded-perception benchmark with 2.9k+ multi-view images and 14.6k+ VQA pairs across four evaluation dimensions. It combines simulator and real-world data, augmented with noise and masking, and employs a four-stage generation pipeline (Data Collection, Annotation, Question Generation, Quality Control) to produce high-quality, event-labelled, semantic VQA data. Evaluations on 40 MLLMs reveal significant gaps in multi-view collaborative reasoning, with sim-to-real transfer improved through supervised fine-tuning on simulated data. The benchmark’s findings highlight task-specific biases and underscore the need for advances in embodied collaboration and robust multi-view reasoning for practical aerial systems.
Abstract
Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.
