Table of Contents
Fetching ...

Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection

Adrian S. Roman, Aiden Chang, Gerardo Meza, Iran R. Roman

TL;DR

This work tackles the scarcity of realistic audio-visual data for sound event localization and detection (SELD) by introducing SELDVisualSynth, a tool that couples SpatialScaper-generated audio with 360° video synthesized over naturalistic backgrounds. The pipeline creates 2,000 FOA clips across 14 rooms and aligns spatiotemporal DoA annotations with visual tiles, enabling effective multi-modal SELD training using 50×50 pixel tiles on 1920×960 backgrounds. Evaluated on STARSS23 with SELDnet-YOLOv8, the approach yields a significant improvement in localization recall (LR) to 56.4 and competitive localization error (LE) of 21.9°, achieving $ER_{20^{\circ}}=0.62$ and $F_{20^{\circ}}=33.2$, surpassing or rivaling audio-only and augmented baselines. The results demonstrate that diverse, synchronized audio-visual synthetic data can boost SELD performance without relying on additional augmentation tricks, and the authors provide open-source data and tools to accelerate multimodal SELD research.

Abstract

We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotations. Experimental results demonstrate that a model trained with this data attains performance gains across multiple metrics, achieving superior localization recall (56.4 LR) and competitive localization error (21.9deg LE). We open-source our data generation tool for maximal use by members of the SELD research community.

Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection

TL;DR

This work tackles the scarcity of realistic audio-visual data for sound event localization and detection (SELD) by introducing SELDVisualSynth, a tool that couples SpatialScaper-generated audio with 360° video synthesized over naturalistic backgrounds. The pipeline creates 2,000 FOA clips across 14 rooms and aligns spatiotemporal DoA annotations with visual tiles, enabling effective multi-modal SELD training using 50×50 pixel tiles on 1920×960 backgrounds. Evaluated on STARSS23 with SELDnet-YOLOv8, the approach yields a significant improvement in localization recall (LR) to 56.4 and competitive localization error (LE) of 21.9°, achieving and , surpassing or rivaling audio-only and augmented baselines. The results demonstrate that diverse, synchronized audio-visual synthetic data can boost SELD performance without relying on additional augmentation tricks, and the authors provide open-source data and tools to accelerate multimodal SELD research.

Abstract

We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotations. Experimental results demonstrate that a model trained with this data attains performance gains across multiple metrics, achieving superior localization recall (56.4 LR) and competitive localization error (21.9deg LE). We open-source our data generation tool for maximal use by members of the SELD research community.

Paper Structure

This paper contains 9 sections, 2 tables.