Table of Contents
Fetching ...

Shared Representation Learning for Reference-Guided Targeted Sound Detection

Shubham Gupta, Adarsh Arigala, B. R. Dilleswari, Sri Rama Murty Kodukula

Abstract

Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a reference audio of that sound is provided. Prior approaches, rely on generating a sound-discriminative conditional embedding vector for the reference and pairing it with a mixture encoder, jointly optimized with a multi-task learning approach. In this work, we propose a unified encoder architecture that processes both the reference and mixture audio within a shared representation space, promoting stronger alignment while reducing architectural complexity. This design choice not only simplifies the overall framework but also enhances generalization to unseen classes. Following the multi-task training paradigm, our method achieves substantial improvements over prior approaches, surpassing existing methods and establishing a new state-of-the-art benchmark for targeted sound detection, with a segment-level F1 score of 83.15% and an overall accuracy of 95.17% on the URBAN-SED dataset.

Shared Representation Learning for Reference-Guided Targeted Sound Detection

Abstract

Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a reference audio of that sound is provided. Prior approaches, rely on generating a sound-discriminative conditional embedding vector for the reference and pairing it with a mixture encoder, jointly optimized with a multi-task learning approach. In this work, we propose a unified encoder architecture that processes both the reference and mixture audio within a shared representation space, promoting stronger alignment while reducing architectural complexity. This design choice not only simplifies the overall framework but also enhances generalization to unseen classes. Following the multi-task training paradigm, our method achieves substantial improvements over prior approaches, surpassing existing methods and establishing a new state-of-the-art benchmark for targeted sound detection, with a segment-level F1 score of 83.15% and an overall accuracy of 95.17% on the URBAN-SED dataset.
Paper Structure (11 sections, 1 equation, 3 figures, 4 tables)

This paper contains 11 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the proposed method. The tagging module comprises two fully connected layers followed by a softmax layer for event prediction.
  • Figure 2: Comparison of per-class F1 scores on URBAN-TSD-Strong, reproduced under the same evaluation protocol.
  • Figure 3: Temporal localization example depicting waveform with ground-truth and predicted event boundaries, together with the model’s frame-level confidence scores.