Table of Contents
Fetching ...

AV-DTEC: Self-Supervised Audio-Visual Fusion for Drone Trajectory Estimation and Classification

Zhenyuan Xiao, Yizhuo Yang, Guili Xu, Xianglong Zeng, Shenghai Yuan

TL;DR

AV-DTEC tackles robust anti-UAV detection under variable lighting with a self-supervised audio-visual fusion framework. It introduces Audio Mamba and Vision Mamba (AVMamba) backed by a selective state-space model, a plug-and-play Feature Fusion Neck with a Residual Cross-Attention FEM, and an Adaptive Adjustment Mechanism (AAM) that uses a teacher-student alignment to balance modalities. Pseudo-labels derived from LiDAR via DBSCAN enable self-supervised training, yielding accurate trajectory estimation and UAV classification without manual annotations. Experiments on the MMAUD dataset show state-of-the-art performance with favorable efficiency, and ablations demonstrate the critical roles of TMamba, FEM, and AAM for robust, cross-lighting fusion. The work is open-sourced, promoting practical deployment of lightweight, multi-modal anti-UAV systems.

Abstract

The increasing use of compact UAVs has created significant threats to public safety, while traditional drone detection systems are often bulky and costly. To address these challenges, we propose AV-DTEC, a lightweight self-supervised audio-visual fusion-based anti-UAV system. AV-DTEC is trained using self-supervised learning with labels generated by LiDAR, and it simultaneously learns audio and visual features through a parallel selective state-space model. With the learned features, a specially designed plug-and-play primary-auxiliary feature enhancement module integrates visual features into audio features for better robustness in cross-lighting conditions. To reduce reliance on auxiliary features and align modalities, we propose a teacher-student model that adaptively adjusts the weighting of visual features. AV-DTEC demonstrates exceptional accuracy and effectiveness in real-world multi-modality data. The code and trained models are publicly accessible on GitHub \url{https://github.com/AmazingDay1/AV-DETC}.

AV-DTEC: Self-Supervised Audio-Visual Fusion for Drone Trajectory Estimation and Classification

TL;DR

AV-DTEC tackles robust anti-UAV detection under variable lighting with a self-supervised audio-visual fusion framework. It introduces Audio Mamba and Vision Mamba (AVMamba) backed by a selective state-space model, a plug-and-play Feature Fusion Neck with a Residual Cross-Attention FEM, and an Adaptive Adjustment Mechanism (AAM) that uses a teacher-student alignment to balance modalities. Pseudo-labels derived from LiDAR via DBSCAN enable self-supervised training, yielding accurate trajectory estimation and UAV classification without manual annotations. Experiments on the MMAUD dataset show state-of-the-art performance with favorable efficiency, and ablations demonstrate the critical roles of TMamba, FEM, and AAM for robust, cross-lighting fusion. The work is open-sourced, promoting practical deployment of lightweight, multi-modal anti-UAV systems.

Abstract

The increasing use of compact UAVs has created significant threats to public safety, while traditional drone detection systems are often bulky and costly. To address these challenges, we propose AV-DTEC, a lightweight self-supervised audio-visual fusion-based anti-UAV system. AV-DTEC is trained using self-supervised learning with labels generated by LiDAR, and it simultaneously learns audio and visual features through a parallel selective state-space model. With the learned features, a specially designed plug-and-play primary-auxiliary feature enhancement module integrates visual features into audio features for better robustness in cross-lighting conditions. To reduce reliance on auxiliary features and align modalities, we propose a teacher-student model that adaptively adjusts the weighting of visual features. AV-DTEC demonstrates exceptional accuracy and effectiveness in real-world multi-modality data. The code and trained models are publicly accessible on GitHub \url{https://github.com/AmazingDay1/AV-DETC}.

Paper Structure

This paper contains 19 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Our audio-visual fusion model effectively identifies and locates drug-smuggling UAVs with high robustness and cost efficiency.
  • Figure 2: AV-DTEC Architecture. During training, the learnable visual token extracted by Vim is trained through the teacher-student model to output the UAV center position and existence probability. For the inference, the token only outputs the existence probability, which is used to adjust the proportion of visual features.
  • Figure 3: The architecture of TMamba and SMamba block.
  • Figure 4: Feature Enhancement Module.
  • Figure 5: The $\overline{\text{Acc}}$ confusion matrix for the classification results of AV-DTEC.
  • ...and 2 more figures