Table of Contents
Fetching ...

HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues

Xiwen Li, Xiaoya Tang, Tolga Tasdizen

TL;DR

This work tackles idling vehicle detection (IVD) by leveraging multi-channel audio and video in surveillance. It introduces HAVT-IVD, a heterogeneity-aware audio-visual transformer with a visual feature pyramid and decoupled detection heads to address modality misalignment, large-scale variation, and training instability. Key contributions include global audio-visual routing via self-attention, SPCA-driven AVCE fusion, multiscale feature integration, and per-scale decoupled heads, achieving state-of-the-art $mAP$ on AVIVD and strong generalization to MAVD. The results demonstrate substantial practical impact for robust IVD in complex driving environments and highlight the approach's extensibility to cross-modal vehicle detection tasks.

Abstract

Idling vehicle detection (IVD) uses surveillance video and multichannel audio to localize and classify vehicles in the last frame as moving, idling, or engine-off in pick-up zones. IVD faces three challenges: (i) modality heterogeneity between visual cues and audio patterns; (ii) large box scale variation requiring multi-resolution detection; and (iii) training instability due to coupled detection heads. The previous end-to-end (E2E) model with simple CBAM-based bi-modal attention fails to handle these issues and often misses vehicles. We propose HAVT-IVD, a heterogeneity-aware network with a visual feature pyramid and decoupled heads. Experiments show HAVT-IVD improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline.

HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues

TL;DR

This work tackles idling vehicle detection (IVD) by leveraging multi-channel audio and video in surveillance. It introduces HAVT-IVD, a heterogeneity-aware audio-visual transformer with a visual feature pyramid and decoupled detection heads to address modality misalignment, large-scale variation, and training instability. Key contributions include global audio-visual routing via self-attention, SPCA-driven AVCE fusion, multiscale feature integration, and per-scale decoupled heads, achieving state-of-the-art on AVIVD and strong generalization to MAVD. The results demonstrate substantial practical impact for robust IVD in complex driving environments and highlight the approach's extensibility to cross-modal vehicle detection tasks.

Abstract

Idling vehicle detection (IVD) uses surveillance video and multichannel audio to localize and classify vehicles in the last frame as moving, idling, or engine-off in pick-up zones. IVD faces three challenges: (i) modality heterogeneity between visual cues and audio patterns; (ii) large box scale variation requiring multi-resolution detection; and (iii) training instability due to coupled detection heads. The previous end-to-end (E2E) model with simple CBAM-based bi-modal attention fails to handle these issues and often misses vehicles. We propose HAVT-IVD, a heterogeneity-aware network with a visual feature pyramid and decoupled heads. Experiments show HAVT-IVD improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline.

Paper Structure

This paper contains 10 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: HAVT-IVD architecture. Shapes not to scale.
  • Figure 2: Illustration of Instance Heterogeneity