Table of Contents
Fetching ...

UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data

Longkun Zou, Jiale Wang, Rongqin Liang, Hai Wu, Ke Chen, Yaowei Wang

TL;DR

UAV-MM3D tackles the need for scalable, richly annotated, multimodal data for low-altitude UAV perception. It provides a synthetic dataset with five modalities and dense 2D/3D annotations across diverse scenes and weather, enabling 2D/3D detection, 6-DoF pose estimation, tracking, and trajectory forecasting. The authors introduce LGFusionNet, a LiDAR-guided fusion baseline that uses cross-branch spatial alignment to fuse RGB and IR with LiDAR and Radar, plus a trajectory baseline; they demonstrate that geometry-guided fusion yields substantial performance gains over unimodal and non-geometric fusion approaches. The dataset's controllable simulation environment and benchmarks offer a standardized platform for evaluating cross-modal perception, with practical implications for airspace security and autonomous UAV operations.

Abstract

Accurate perception of UAVs in complex low-altitude environments is critical for airspace security and related intelligent systems. Developing reliable solutions requires large-scale, accurately annotated, and multimodal data. However, real-world UAV data collection faces inherent constraints due to airspace regulations, privacy concerns, and environmental variability, while manual annotation of 3D poses and cross-modal correspondences is time-consuming and costly. To overcome these challenges, we introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding. It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions (clear, cloudy, rainy, foggy), featuring multiple UAV models (micro, small, medium-sized) and five modalities - RGB, IR, LiDAR, Radar, and DVS (Dynamic Vision Sensor). Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting. We further propose LGFusionNet, a LiDAR-guided multimodal fusion baseline, and a dedicated UAV trajectory prediction baseline to facilitate benchmarking. With its controllable simulation environment, comprehensive scenario coverage, and rich annotations, UAV3D offers a public benchmark for advancing 3D perception of UAVs.

UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data

TL;DR

UAV-MM3D tackles the need for scalable, richly annotated, multimodal data for low-altitude UAV perception. It provides a synthetic dataset with five modalities and dense 2D/3D annotations across diverse scenes and weather, enabling 2D/3D detection, 6-DoF pose estimation, tracking, and trajectory forecasting. The authors introduce LGFusionNet, a LiDAR-guided fusion baseline that uses cross-branch spatial alignment to fuse RGB and IR with LiDAR and Radar, plus a trajectory baseline; they demonstrate that geometry-guided fusion yields substantial performance gains over unimodal and non-geometric fusion approaches. The dataset's controllable simulation environment and benchmarks offer a standardized platform for evaluating cross-modal perception, with practical implications for airspace security and autonomous UAV operations.

Abstract

Accurate perception of UAVs in complex low-altitude environments is critical for airspace security and related intelligent systems. Developing reliable solutions requires large-scale, accurately annotated, and multimodal data. However, real-world UAV data collection faces inherent constraints due to airspace regulations, privacy concerns, and environmental variability, while manual annotation of 3D poses and cross-modal correspondences is time-consuming and costly. To overcome these challenges, we introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding. It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions (clear, cloudy, rainy, foggy), featuring multiple UAV models (micro, small, medium-sized) and five modalities - RGB, IR, LiDAR, Radar, and DVS (Dynamic Vision Sensor). Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting. We further propose LGFusionNet, a LiDAR-guided multimodal fusion baseline, and a dedicated UAV trajectory prediction baseline to facilitate benchmarking. With its controllable simulation environment, comprehensive scenario coverage, and rich annotations, UAV3D offers a public benchmark for advancing 3D perception of UAVs.

Paper Structure

This paper contains 20 sections, 2 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: UAV-MM3D data collection framework. UE4–CARLA simulation server provides multi-weather UAV scenes, while the Python client consists of a Weather Controller, Frame Controller, and Coordinate Processor. It outputs a fully aligned multi-modal dataset across diverse conditions.
  • Figure 2: Visualization of the seven UAV platforms simulated in UAV-MM3D, including DJI Avata 2, DJI Mavic Mini, DJI Phantom 4, M210 RTK, Matrice 600 Pro, Matrice 300 RTK, and the K3 Mini drone.
  • Figure 3: Examples of multi-modal sensor outputs in UAV-MM3D. Each synchronized frame includes RGB imagery, infrared thermal data, and DVS event streams. LiDAR point clouds are reprojected onto the RGB view for 2D alignment, while millimeter-wave radar measurements are rendered as velocity heatmaps. All modalities are temporally aligned under a unified simulation timestamp.
  • Figure 4: Distribution statistics of UAV-MM3D. (a) Overall UAV–sensor distance distribution. (b) Proportion of UAV types across all simulated scenes. (c) Orientation statistics (roll, pitch, yaw) computed from 3D bounding boxes. (d) Difficulty distribution estimated from UAV size, sensing distance, and weather conditions.
  • Figure 5: Illustration of the eight physically based weather conditions in UAV-MM3D. Each environment is rendered under both day and night illumination, covering clear, rain, fog, and snow settings. These combinations introduce diverse lighting, atmospheric, and visibility variations that enhance the realism and challenge of UAV perception across different urban and suburban scenes.
  • ...and 2 more figures