Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection
Han Yin, Yang Xiao, Jisheng Bai, Rohan Kumar Das
TL;DR
This work tackles robust SED in realistic noisy environments where target sounds are unknown and overlapping. It introduces an LLM-assisted pipeline that fine-tunes a SED model with LLM-guided noise augmentation, and then uses the model to generate text queries for a pre-trained LASS to extract target sounds before final SED inference. The approach yields notable gains on DESED and WildDESED benchmarks, showing the value of LLM-informed data augmentation and text-queried separation for noise-robust SED, and it includes extensive ablations to validate each component. Overall, the study demonstrates a feasible and effective direction for improving SED robustness in real-world acoustics and provides code and pretrained models for future research.
Abstract
Sound Event Detection (SED) is challenging in noisy environments where overlapping sounds obscure target events. Language-queried audio source separation (LASS) aims to isolate the target sound events from a noisy clip. However, this approach can fail when the exact target sound is unknown, particularly in noisy test sets, leading to reduced performance. To address this issue, we leverage the capabilities of large language models (LLMs) to analyze and summarize acoustic data. By using LLMs to identify and select specific noise types, we implement a noise augmentation method for noise-robust fine-tuning. The fine-tuned model is applied to predict clip-wise event predictions as text queries for the LASS model. Our studies demonstrate that the proposed method improves SED performance in noisy environments. This work represents an early application of LLMs in noise-robust SED and suggests a promising direction for handling overlapping events in SED. Codes and pretrained models are available at https://github.com/apple-yinhan/Noise-robust-SED.
