UEMM-Air: Make Unmanned Aerial Vehicles Perform More Multi-modal Tasks
Liang Yao, Fan Liu, Shengxiang Xu, Chuanyi Zhang, Xing Ma, Jianyu Jiang, Zequan Wang, Shimin Di, Jun Zhou
TL;DR
UEMM-Air tackles the data scarcity of multi-modal UAV perception by introducing a large-scale synthetic dataset comprising $120k$ image pairs across six modalities, augmented with an automatic annotation pipeline and cross-modal text generation. Built with Unreal Engine and AirSim, it supports object detection, instance segmentation, image-text contrastive learning, and referring segmentation, enabling zero-shot and cross-modal capabilities through natural language captions. The authors provide comprehensive benchmarks across detection, segmentation, and cross-modal tasks, demonstrating competitive performance, improved transferability from synthetic to real UAV datasets, and the benefits of multi-modal fusion. This dataset promises to improve pretraining, generalization, and multi-modal reasoning for UAV perception in varied environments.
Abstract
The development of multi-modal learning for Unmanned Aerial Vehicles (UAVs) typically relies on a large amount of pixel-aligned multi-modal image data. However, existing datasets face challenges such as limited modalities, high construction costs, and imprecise annotations. To this end, we propose a synthetic multi-modal UAV-based multi-task dataset, UEMM-Air. Specifically, we simulate various UAV flight scenarios and object types using the Unreal Engine (UE). Then we design the UAV's flight logic to automatically collect data from different scenarios, perspectives, and altitudes. Furthermore, we propose a novel heuristic automatic annotation algorithm to generate accurate object detection labels. Finally, we utilize labels to generate text descriptions of images to make our UEMM-Air support more cross-modality tasks. In total, our UEMM-Air consists of 120k pairs of images with 6 modalities and precise annotations. Moreover, we conduct numerous experiments and establish new benchmark results on our dataset. We also found that models pre-trained on UEMM-Air exhibit better performance on downstream tasks compared to other similar datasets. The dataset is publicly available (https://github.com/1e12Leon/UEMM-Air) to support the research of multi-modal tasks on UAVs.
