Table of Contents
Fetching ...

AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

Jirong Zha, Yuxuan Fan, Tianyu Zhang, Geng Chen, Yingfeng Chen, Chen Gao, Xinlei Chen

TL;DR

AirCopBench addresses the absence of benchmarks for multi-UAV embodied perception by providing a large-scale, degraded-perception benchmark with 2.9k+ multi-view images and 14.6k+ VQA pairs across four evaluation dimensions. It combines simulator and real-world data, augmented with noise and masking, and employs a four-stage generation pipeline (Data Collection, Annotation, Question Generation, Quality Control) to produce high-quality, event-labelled, semantic VQA data. Evaluations on 40 MLLMs reveal significant gaps in multi-view collaborative reasoning, with sim-to-real transfer improved through supervised fine-tuning on simulated data. The benchmark’s findings highlight task-specific biases and underscore the need for advances in embodied collaboration and robust multi-view reasoning for practical aerial systems.

Abstract

Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

TL;DR

AirCopBench addresses the absence of benchmarks for multi-UAV embodied perception by providing a large-scale, degraded-perception benchmark with 2.9k+ multi-view images and 14.6k+ VQA pairs across four evaluation dimensions. It combines simulator and real-world data, augmented with noise and masking, and employs a four-stage generation pipeline (Data Collection, Annotation, Question Generation, Quality Control) to produce high-quality, event-labelled, semantic VQA data. Evaluations on 40 MLLMs reveal significant gaps in multi-view collaborative reasoning, with sim-to-real transfer improved through supervised fine-tuning on simulated data. The benchmark’s findings highlight task-specific biases and underscore the need for advances in embodied collaboration and robust multi-view reasoning for practical aerial systems.

Abstract

Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

Paper Structure

This paper contains 24 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: a) Illustration of multi-drone collaborative perception with various perception degradation. b) The performance of 6 popular MLLMs, along with human and random guess baselines, on AirCopBench.
  • Figure 2: AirCopBench includes 14 task types across 4 evaluation dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision. This categorization facilitates a systematic evaluation of MLLMs, from image understanding and quality analysis to multi-UAV information exchange for improved collaborative embodied perception.
  • Figure 3: AirCopBench generation pipeline includes 4 main steps: Data Collection, Data Annotation, Question Generation, and Quality Control. This systematic approach ensures the validity and high quality of the generated dataset.
  • Figure 4: Statistical Overview of AirCopBench. a) Distribution of VQA pairs across 14 task types. b) Distribution of VQA pairs from various data sources with different numbers of observing UAV groups. c) Distribution of images featuring diverse perception degradation types.
  • Figure 5: Correlation coefficients of MLLMs' performance across all tasks, with higher values indicating greater similarity in the cognitive abilities required by the two tasks.
  • ...and 1 more figures