Table of Contents
Fetching ...

Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

Adrian S. Roman, Baladithya Balamurugan, Rithik Pothuganti

TL;DR

The paper tackles sound event localization and detection (SELD) in real 360-degree audio-visual environments, where data scarcity and misalignment between audio and video hinder performance. It extends the audio-only SELDnet23 by adding a visual branch that fuses video-based detections from detectors such as YOLO and DETIC with audio features prior to the GRU, and it introduces audio-visual data augmentation and synthetic data generation, including a 360° AV data generator. Two AV model variants using YOLO-based and DETIC-based detectors are evaluated, with two MHSA layers, demonstrating improved localization and detection metrics over baselines; DETIC tailored to STARSS23 achieves the best localization error while YOLO8 provides strong overall performance. The findings show practical gains for AV SELD in 360° scenes and are accompanied by an open-source framework to support AV SELD research and deployment.

Abstract

This technical report details our work towards building an enhanced audio-visual sound event localization and detection (SELD) network. We build on top of the audio-only SELDnet23 model and adapt it to be audio-visual by merging both audio and video information prior to the gated recurrent unit (GRU) of the audio-only network. Our model leverages YOLO and DETIC object detectors. We also build a framework that implements audio-visual data augmentation and audio-visual synthetic data generation. We deliver an audio-visual SELDnet system that outperforms the existing audio-visual SELD baseline.

Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

TL;DR

The paper tackles sound event localization and detection (SELD) in real 360-degree audio-visual environments, where data scarcity and misalignment between audio and video hinder performance. It extends the audio-only SELDnet23 by adding a visual branch that fuses video-based detections from detectors such as YOLO and DETIC with audio features prior to the GRU, and it introduces audio-visual data augmentation and synthetic data generation, including a 360° AV data generator. Two AV model variants using YOLO-based and DETIC-based detectors are evaluated, with two MHSA layers, demonstrating improved localization and detection metrics over baselines; DETIC tailored to STARSS23 achieves the best localization error while YOLO8 provides strong overall performance. The findings show practical gains for AV SELD in 360° scenes and are accompanied by an open-source framework to support AV SELD research and deployment.

Abstract

This technical report details our work towards building an enhanced audio-visual sound event localization and detection (SELD) network. We build on top of the audio-only SELDnet23 model and adapt it to be audio-visual by merging both audio and video information prior to the gated recurrent unit (GRU) of the audio-only network. Our model leverages YOLO and DETIC object detectors. We also build a framework that implements audio-visual data augmentation and audio-visual synthetic data generation. We deliver an audio-visual SELDnet system that outperforms the existing audio-visual SELD baseline.
Paper Structure (11 sections, 4 figures, 1 table)

This paper contains 11 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Video pixel swapping augmentations.
  • Figure 2: $360^\circ$ audio-visual synthetic frame (top) and its spatialized audio displayed using an acoustic camera (bottom).
  • Figure 3: Enhanced audio-visual SELD system.
  • Figure 4: Per-class localization error performance comparing the audio-visual baseline against our proposed enhancements.