Table of Contents
Fetching ...

DroneKey++: A Size Prior-free Method and New Benchmark for Drone 3D Pose Estimation from Sequential Images

Seo-Bin Hwang, Yeong-Jun Cho

TL;DR

The paper tackles the problem of 3D drone pose estimation without relying on size or mesh priors, addressing the scarcity and limited generalization of existing datasets. It introduces DroneKey++, a prior-free end-to-end framework that fuses a keypoint encoder (joint 2D keypoint detection and drone classification) with a ray-based 3D pose decoder, achieving accurate rotation and translation from monocular sequences. To support robust evaluation across diverse drones and environments, it also presents 6DroneSyn, a large synthetic benchmark with 52,920 images across 7 models and 88 outdoor backgrounds, generated via 360-degree panorama synthesis to reduce domain gap. Empirical results show that DroneKey++ delivers state-of-the-art rotation and translation accuracy (e.g., $\text{MAE}_R=17.34^{\circ}$, $\text{MAE}_t=0.135$ m) while maintaining real-time inference ($414.07$ FPS on GPU, $19.25$ FPS on CPU), demonstrating strong generalization and practicality for anti-drone and surveillance applications.

Abstract

Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34 deg and MedAE 17.1 deg for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is publicly available.

DroneKey++: A Size Prior-free Method and New Benchmark for Drone 3D Pose Estimation from Sequential Images

TL;DR

The paper tackles the problem of 3D drone pose estimation without relying on size or mesh priors, addressing the scarcity and limited generalization of existing datasets. It introduces DroneKey++, a prior-free end-to-end framework that fuses a keypoint encoder (joint 2D keypoint detection and drone classification) with a ray-based 3D pose decoder, achieving accurate rotation and translation from monocular sequences. To support robust evaluation across diverse drones and environments, it also presents 6DroneSyn, a large synthetic benchmark with 52,920 images across 7 models and 88 outdoor backgrounds, generated via 360-degree panorama synthesis to reduce domain gap. Empirical results show that DroneKey++ delivers state-of-the-art rotation and translation accuracy (e.g., , m) while maintaining real-time inference ( FPS on GPU, FPS on CPU), demonstrating strong generalization and practicality for anti-drone and surveillance applications.

Abstract

Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34 deg and MedAE 17.1 deg for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is publicly available.
Paper Structure (18 sections, 14 equations, 4 figures, 6 tables)

This paper contains 18 sections, 14 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overall framework for DroneKey++. The proposed end-to-end pipeline consists of three main components: (a) keypoint encoder, which extracts 2D keypoints and drone class features from the input image sequence; (b) 3D pose decoder, which integrates class embedding and ray embedding to estimate the drone’s 3D rotation and translation; and (c) loss function, which combines encoder and decoder supervision into the total loss. This unified architecture enables simultaneous keypoint detection, class prediction, and accurate 3D pose estimation.
  • Figure 2: 6DroneSyn Dataset Annotations. Examples of annotations: (a) 2D bounding boxes and 2D keypoints, (b) 3D keypoints with translation and rotation vectors, and (c) full 3D trajectories with rotation and translation from sequential images.
  • Figure 3: Qualitative results of the drone 3D pose estimation based on the proposed method. The results show pose estimation across different validation and test scenes and drone types after applying Gaussian smoothing post-processing.
  • Figure 4: Comparison of feature distributions across drone image datasets. Dimensionality reduction visualization comparing real-world drone images (●), existing synthetic dataset (▲), and our 360-camera-based synthetic dataset (■) using (a) PCA with the first two principal components and (b) t-SNE with 2D embedding components.