Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes
Adrian S. Roman, Baladithya Balamurugan, Rithik Pothuganti
TL;DR
The paper tackles sound event localization and detection (SELD) in real 360-degree audio-visual environments, where data scarcity and misalignment between audio and video hinder performance. It extends the audio-only SELDnet23 by adding a visual branch that fuses video-based detections from detectors such as YOLO and DETIC with audio features prior to the GRU, and it introduces audio-visual data augmentation and synthetic data generation, including a 360° AV data generator. Two AV model variants using YOLO-based and DETIC-based detectors are evaluated, with two MHSA layers, demonstrating improved localization and detection metrics over baselines; DETIC tailored to STARSS23 achieves the best localization error while YOLO8 provides strong overall performance. The findings show practical gains for AV SELD in 360° scenes and are accompanied by an open-source framework to support AV SELD research and deployment.
Abstract
This technical report details our work towards building an enhanced audio-visual sound event localization and detection (SELD) network. We build on top of the audio-only SELDnet23 model and adapt it to be audio-visual by merging both audio and video information prior to the gated recurrent unit (GRU) of the audio-only network. Our model leverages YOLO and DETIC object detectors. We also build a framework that implements audio-visual data augmentation and audio-visual synthetic data generation. We deliver an audio-visual SELDnet system that outperforms the existing audio-visual SELD baseline.
