Table of Contents
Fetching ...

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, Ian McLoughlin

TL;DR

DASM tackles open-vocabulary sound event detection by reframing SED as frame-level retrieval against multi-modal queries generated via CLAP, enabling detection of unseen events. It introduces a dual-stream decoder that separately handles cross-modal event recognition and temporal localization, paired with an inference-time masking strategy to leverage relationships between base and novel classes. The method achieves strong open-vocabulary performance on AudioSet Strong, surpasses CLAP-based approaches, and demonstrates notable cross-dataset generalization on DESED, even exceeding some supervised baselines in zero-shot settings. Collectively, DASM offers a scalable, flexible framework for open-set SED with solid localization accuracy and potential for integration with multimodal LLMs.

Abstract

Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

TL;DR

DASM tackles open-vocabulary sound event detection by reframing SED as frame-level retrieval against multi-modal queries generated via CLAP, enabling detection of unseen events. It introduces a dual-stream decoder that separately handles cross-modal event recognition and temporal localization, paired with an inference-time masking strategy to leverage relationships between base and novel classes. The method achieves strong open-vocabulary performance on AudioSet Strong, surpasses CLAP-based approaches, and demonstrates notable cross-dataset generalization on DESED, even exceeding some supervised baselines in zero-shot settings. Collectively, DASM offers a scalable, flexible framework for open-set SED with solid localization accuracy and potential for integration with multimodal LLMs.

Abstract

Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.

Paper Structure

This paper contains 31 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of three architectures. (a) A classic closed-set SED model, unable to detect novel classes unseen during training. (b) CLAP architecture, which enables novel class recognition but is difficult to generate frame-level predictions. (c) The proposed framework, which detects sound events specified by text or audio queries, enabling novel class detection with frame-level prediction.
  • Figure 2: Detect Any Sound Model (DASM) enables the detection of sound events based on arbitrary text or audio queries. The overall framework, query generation module, and cross-modality event decoder module are presented in (a), (b), and (c), respectively. Note that the channel dimension of features and query vectors is omitted in the figure for simplicity.
  • Figure 3: Two attention masking strategies in self-attention layers of the event decoder during inference. (a) Novel class query vectors cannot attend to base class query vectors. (b) Base class query vectors remain visible to novel class query vectors.
  • Figure 4: Impact of audio query duration. We evaluate DASM on novel classes with audio query durations ranging from 5 to 6 minutes and progressively reduce the query duration.
  • Figure 5: Qualitative results of DASM. The model is trained under the AS-partial setting with an HTS-AT backbone and multi-modal queries. The audio samples are randomly selected from the AudioSet evaluation set, and those in bold are novel classes.