Table of Contents
Fetching ...

Exploring Text-Queried Sound Event Detection with Audio Source Separation

Han Yin, Jisheng Bai, Yang Xiao, Hui Wang, Siqi Zheng, Yafeng Chen, Rohan Kumar Das, Chong Deng, Jianfeng Chen

TL;DR

This work tackles polyphonic sound event detection by introducing a text-queried SED (TQ-SED) framework that leverages a language-queried audio source separation model, AudioSep-DP, to isolate event-specific audio prior to per-event detection. AudioSep-DP augments a frequency-domain ResUNet with a dual-path recurrent network and a CLAP-based text encoder with FiLM conditioning, trained with an L1 loss on large audio-text datasets. TQ-SED uses multiple lightweight per-event detectors on separated tracks, trained with mean-squared-error loss on soft labels, and achieves state-of-the-art performance on the MAESTRO-Real dataset, while AudioSep-DP attains top separation metrics on DCASE 2024 Task 9. The approach demonstrates substantial improvements in detecting overlapping events and maintains low model complexity, with code and models released for reproducibility and further research.

Abstract

In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks corresponding to different events from the input audio. Then, multiple target SED branches are employed to detect individual events. AudioSep is a state-of-the-art LASS model, but has limitations in extracting dynamic audio information because of its pure convolutional structure for separation. To address this, we integrate a dual-path recurrent neural network block into the model. We refer to this structure as AudioSep-DP, which achieves the first place in DCASE 2024 Task 9 on language-queried audio source separation (objective single model track). Experimental results show that TQ-SED can significantly improve the SED performance, with an improvement of 7.22\% on F1 score over the conventional framework. Additionally, we setup comprehensive experiments to explore the impact of model complexity. The source code and pre-trained model are released at https://github.com/apple-yinhan/TQ-SED.

Exploring Text-Queried Sound Event Detection with Audio Source Separation

TL;DR

This work tackles polyphonic sound event detection by introducing a text-queried SED (TQ-SED) framework that leverages a language-queried audio source separation model, AudioSep-DP, to isolate event-specific audio prior to per-event detection. AudioSep-DP augments a frequency-domain ResUNet with a dual-path recurrent network and a CLAP-based text encoder with FiLM conditioning, trained with an L1 loss on large audio-text datasets. TQ-SED uses multiple lightweight per-event detectors on separated tracks, trained with mean-squared-error loss on soft labels, and achieves state-of-the-art performance on the MAESTRO-Real dataset, while AudioSep-DP attains top separation metrics on DCASE 2024 Task 9. The approach demonstrates substantial improvements in detecting overlapping events and maintains low model complexity, with code and models released for reproducibility and further research.

Abstract

In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks corresponding to different events from the input audio. Then, multiple target SED branches are employed to detect individual events. AudioSep is a state-of-the-art LASS model, but has limitations in extracting dynamic audio information because of its pure convolutional structure for separation. To address this, we integrate a dual-path recurrent neural network block into the model. We refer to this structure as AudioSep-DP, which achieves the first place in DCASE 2024 Task 9 on language-queried audio source separation (objective single model track). Experimental results show that TQ-SED can significantly improve the SED performance, with an improvement of 7.22\% on F1 score over the conventional framework. Additionally, we setup comprehensive experiments to explore the impact of model complexity. The source code and pre-trained model are released at https://github.com/apple-yinhan/TQ-SED.
Paper Structure (18 sections, 5 equations, 2 figures, 5 tables)

This paper contains 18 sections, 5 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The overview of the proposed TQ-SED framework, where we pre-train AudioSep-DP on large scale audio-text pairs.
  • Figure 2: F1 scores for each event of different frameworks on MAESTRO-Real.