C2FDrone: Coarse-to-Fine Drone-to-Drone Detection using Vision Transformer Networks
Sairam VC Rebbapragada, Pranoy Panda, Vineeth N Balasubramanian
TL;DR
This work tackles the challenging problem of drone-to-drone detection under extreme small-object and distortion conditions, aiming for real-time performance. It introduces a coarse-to-fine pipeline built on vision transformers, combining an Object Enhancement Net to produce an objectness mask with a 4D-querying, DETR-based fine detector (DAB DETR) primed by coarse detections, including temporal cues. The approach employs a dedicated objectness loss L_OE, a decoder-query alignment loss, and a query-size constraint, achieving state-of-the-art F1 improvements of $+7\%$, $+3\%$, and $+1\%$ on FL-Drones, AOT, and NPS-Drones, respectively, while running at about $31$ FPS on a Jetson Xavier NX for $640$-pixel frames and exhibiting very low FPPI ($3.2\times 10^{-4}$). These results demonstrate practical viability for edge-enabled, real-time drone safety, counter-drone, and search-and-rescue applications, with code to be released.
Abstract
A vision-based drone-to-drone detection system is crucial for various applications like collision avoidance, countering hostile drones, and search-and-rescue operations. However, detecting drones presents unique challenges, including small object sizes, distortion, occlusion, and real-time processing requirements. Current methods integrating multi-scale feature fusion and temporal information have limitations in handling extreme blur and minuscule objects. To address this, we propose a novel coarse-to-fine detection strategy based on vision transformers. We evaluate our approach on three challenging drone-to-drone detection datasets, achieving F1 score enhancements of 7%, 3%, and 1% on the FL-Drones, AOT, and NPS-Drones datasets, respectively. Additionally, we demonstrate real-time processing capabilities by deploying our model on an edge-computing device. Our code will be made publicly available.
