Table of Contents
Fetching ...

UEMM-Air: Make Unmanned Aerial Vehicles Perform More Multi-modal Tasks

Liang Yao, Fan Liu, Shengxiang Xu, Chuanyi Zhang, Xing Ma, Jianyu Jiang, Zequan Wang, Shimin Di, Jun Zhou

TL;DR

UEMM-Air tackles the data scarcity of multi-modal UAV perception by introducing a large-scale synthetic dataset comprising $120k$ image pairs across six modalities, augmented with an automatic annotation pipeline and cross-modal text generation. Built with Unreal Engine and AirSim, it supports object detection, instance segmentation, image-text contrastive learning, and referring segmentation, enabling zero-shot and cross-modal capabilities through natural language captions. The authors provide comprehensive benchmarks across detection, segmentation, and cross-modal tasks, demonstrating competitive performance, improved transferability from synthetic to real UAV datasets, and the benefits of multi-modal fusion. This dataset promises to improve pretraining, generalization, and multi-modal reasoning for UAV perception in varied environments.

Abstract

The development of multi-modal learning for Unmanned Aerial Vehicles (UAVs) typically relies on a large amount of pixel-aligned multi-modal image data. However, existing datasets face challenges such as limited modalities, high construction costs, and imprecise annotations. To this end, we propose a synthetic multi-modal UAV-based multi-task dataset, UEMM-Air. Specifically, we simulate various UAV flight scenarios and object types using the Unreal Engine (UE). Then we design the UAV's flight logic to automatically collect data from different scenarios, perspectives, and altitudes. Furthermore, we propose a novel heuristic automatic annotation algorithm to generate accurate object detection labels. Finally, we utilize labels to generate text descriptions of images to make our UEMM-Air support more cross-modality tasks. In total, our UEMM-Air consists of 120k pairs of images with 6 modalities and precise annotations. Moreover, we conduct numerous experiments and establish new benchmark results on our dataset. We also found that models pre-trained on UEMM-Air exhibit better performance on downstream tasks compared to other similar datasets. The dataset is publicly available (https://github.com/1e12Leon/UEMM-Air) to support the research of multi-modal tasks on UAVs.

UEMM-Air: Make Unmanned Aerial Vehicles Perform More Multi-modal Tasks

TL;DR

UEMM-Air tackles the data scarcity of multi-modal UAV perception by introducing a large-scale synthetic dataset comprising image pairs across six modalities, augmented with an automatic annotation pipeline and cross-modal text generation. Built with Unreal Engine and AirSim, it supports object detection, instance segmentation, image-text contrastive learning, and referring segmentation, enabling zero-shot and cross-modal capabilities through natural language captions. The authors provide comprehensive benchmarks across detection, segmentation, and cross-modal tasks, demonstrating competitive performance, improved transferability from synthetic to real UAV datasets, and the benefits of multi-modal fusion. This dataset promises to improve pretraining, generalization, and multi-modal reasoning for UAV perception in varied environments.

Abstract

The development of multi-modal learning for Unmanned Aerial Vehicles (UAVs) typically relies on a large amount of pixel-aligned multi-modal image data. However, existing datasets face challenges such as limited modalities, high construction costs, and imprecise annotations. To this end, we propose a synthetic multi-modal UAV-based multi-task dataset, UEMM-Air. Specifically, we simulate various UAV flight scenarios and object types using the Unreal Engine (UE). Then we design the UAV's flight logic to automatically collect data from different scenarios, perspectives, and altitudes. Furthermore, we propose a novel heuristic automatic annotation algorithm to generate accurate object detection labels. Finally, we utilize labels to generate text descriptions of images to make our UEMM-Air support more cross-modality tasks. In total, our UEMM-Air consists of 120k pairs of images with 6 modalities and precise annotations. Moreover, we conduct numerous experiments and establish new benchmark results on our dataset. We also found that models pre-trained on UEMM-Air exhibit better performance on downstream tasks compared to other similar datasets. The dataset is publicly available (https://github.com/1e12Leon/UEMM-Air) to support the research of multi-modal tasks on UAVs.
Paper Structure (29 sections, 10 figures, 8 tables)

This paper contains 29 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: (a) Comparison of UEMM-Air and other UAV environmental perception datasets. (b) Various UEMM-Air targeted tasks. Our stands out as the largest in terms of data scale, featuring the most paired modality types and the greatest variety of tasks among existing datasets.
  • Figure 2: UEMM-Air is a multi-scene, multi-modal, and multi-perspective UAV-based perception dataset. (a) Scence (outer) and object category (inner) distribution. (b) UEMM-Air features multiple scenes and various perspectives of the same view. (c) UEMM-Air encompasses 6 modalities: RGB, surface normals, segmentation, depth, IMU parameters, and textual descriptions.
  • Figure 3: Pipeline of our data construction. Step 1: We design the automatic flight logic of the UAV to collect images from different altitudes, perspectives, and modalities. Step 2: We perform contour detection on the segmentation image to obtain object bounding boxes. To alleviate visually overlapped situations, we introduce the depth information, where a significant change in depth typically indicates multiple objects. Step 3: We extract the objects' categories, quantities, and spatial relationships to generate captions for image-text contrastive learning and referring image segmentation.
  • Figure 4: We randomly sampled images from various scenes and visualized the features extracted by the CLIP image encoder through T-SNE. The significant differences in features across different scenes indicate that our dataset is beneficial for enhancing the model's domain generalization performance.
  • Figure 5: Comparison of SynDrone and our UEMM-Air. Red and yellow bounding boxes indicate incorrect and correct labels, respectively. We provide two viewpoints from one scene in UEMM-Air, where blue boxes indicate originally blocked objects in the other viewpoint. SynDrone has incorrect labels where objects are visibly blocked, while UEMM-Air consistently demonstrates superior labeling accuracy, especially in challenging scenarios where objects are partially obscured.
  • ...and 5 more figures