C2FDrone: Coarse-to-Fine Drone-to-Drone Detection using Vision Transformer Networks

Sairam VC Rebbapragada; Pranoy Panda; Vineeth N Balasubramanian

C2FDrone: Coarse-to-Fine Drone-to-Drone Detection using Vision Transformer Networks

Sairam VC Rebbapragada, Pranoy Panda, Vineeth N Balasubramanian

TL;DR

This work tackles the challenging problem of drone-to-drone detection under extreme small-object and distortion conditions, aiming for real-time performance. It introduces a coarse-to-fine pipeline built on vision transformers, combining an Object Enhancement Net to produce an objectness mask with a 4D-querying, DETR-based fine detector (DAB DETR) primed by coarse detections, including temporal cues. The approach employs a dedicated objectness loss L_OE, a decoder-query alignment loss, and a query-size constraint, achieving state-of-the-art F1 improvements of $+7\%$, $+3\%$, and $+1\%$ on FL-Drones, AOT, and NPS-Drones, respectively, while running at about $31$ FPS on a Jetson Xavier NX for $640$-pixel frames and exhibiting very low FPPI ($3.2\times 10^{-4}$). These results demonstrate practical viability for edge-enabled, real-time drone safety, counter-drone, and search-and-rescue applications, with code to be released.

Abstract

A vision-based drone-to-drone detection system is crucial for various applications like collision avoidance, countering hostile drones, and search-and-rescue operations. However, detecting drones presents unique challenges, including small object sizes, distortion, occlusion, and real-time processing requirements. Current methods integrating multi-scale feature fusion and temporal information have limitations in handling extreme blur and minuscule objects. To address this, we propose a novel coarse-to-fine detection strategy based on vision transformers. We evaluate our approach on three challenging drone-to-drone detection datasets, achieving F1 score enhancements of 7%, 3%, and 1% on the FL-Drones, AOT, and NPS-Drones datasets, respectively. Additionally, we demonstrate real-time processing capabilities by deploying our model on an edge-computing device. Our code will be made publicly available.

C2FDrone: Coarse-to-Fine Drone-to-Drone Detection using Vision Transformer Networks

TL;DR

, and

on FL-Drones, AOT, and NPS-Drones, respectively, while running at about

FPS on a Jetson Xavier NX for

-pixel frames and exhibiting very low FPPI (

). These results demonstrate practical viability for edge-enabled, real-time drone safety, counter-drone, and search-and-rescue applications, with code to be released.

Abstract

Paper Structure (19 sections, 3 equations, 4 figures, 5 tables)

This paper contains 19 sections, 3 equations, 4 figures, 5 tables.

INTRODUCTION
RELATED WORK
Drone Detection
DETR Models for Object Detection
METHODOLOGY
Coarse Level: Objectness Mask
Fine-grained Level: Drone Localization
Loss Functions
EXPERIMENTS AND RESULTS
Datasets
Implementation Details
Evaluation metrics
Comparison with Existing Works
Ablation Studies
Qualitative Results
...and 4 more sections

Figures (4)

Figure 1: A challenging frame from NPS Drones dataset li2016multi. Green boxes - ground truth, Red boxes - model predictions a) Traditional methods uniformly scan the entire frame for drones, leading to wasted effort and missed detections in complex scenarios b) Our method precisely localizes drones using a coarse-to-fine detection approach c) Coarse level narrows down the search space by generating an objectness mask d) Fine-grained level focuses on the refined search space, enhancing drone detection.
Figure 2: Our Coarse-to-Fine detection approach. We process video frames with the Swin Transformer liu2021swin followed by FPN lin2017feature to obtain multi-scale features which are input to DAB DETR liu2022dabdetr. OEN refines the features from Swin layers 1, 2, and 3 by enhancing foreground details and reducing background noise. By computing the mean of the enhanced feature map and applying a threshold, we obtain coarse detection results that highlight the regions likely to contain objects. We utilize these regions to prime the DAB DETR decoder, significantly reducing search space and improving the localization performance. Green boxes - ground truth, Red boxes - model predictions. $L_{cls}$ & $L_{reg}$ are the classification and regression losses respectively, commonly used with DETR-family models liu2022dabdetr
Figure 3: Comparison of Precision vs Recall curves between TransVisDronesangam2023transvisdrone and Our method.
Figure 4: Qualitative Analysis: We use coarse-level localization information of drones to guide the DAB-DETR decoder queries (Equation \ref{['loss:dec_query']}). Traditional FPN features contain severe noise (Column b), which is mitigated by our proposed Object Enhancement Network (Column c), leading to accurate drone detections in challenging scenarios (Column d). Green - ground truth, Blue - baseline predictions and Red box - our model predictions.

C2FDrone: Coarse-to-Fine Drone-to-Drone Detection using Vision Transformer Networks

TL;DR

Abstract

C2FDrone: Coarse-to-Fine Drone-to-Drone Detection using Vision Transformer Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)