Contrastive Loss Based Frame-wise Feature disentanglement for Polyphonic Sound Event Detection
Yadong Guan, Jiqing Han, Hongwei Song, Wenjie Song, Guibin Zheng, Tieran Zheng, Yongjun He
TL;DR
This work tackles polyphonic SED by addressing entangled frame-wise features that impair discrimination when events overlap. It introduces category-specific projectors to learn category-aligned subspaces and a frame-wise contrastive loss to maximize common information among same-category frames, coupled with a semi-supervised extension that leverages unlabeled data via epoch-weighted pseudo-labels. Ablation studies on the DESED dataset show consistent improvements in PSDS metrics, especially PSDS2 for overlapping events, validating the disentanglement approach with only modest increases in parameters. The method offers a practical route to more accurate end-to-end SED in noisy, real-world environments while reducing labeling requirements through semi-supervised learning.
Abstract
Overlapping sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlapping events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning framework to learn a category-specific representation. Specifically, we employ different projectors to learn the frame-wise features for each category. To ensure that these feature does not contain information of other categories, we maximize the common information between frame-wise features within the same category and propose a frame-wise contrastive loss. In addition, considering that the labeled data used by the proposed method is limited, we propose a semi-supervised frame-wise contrastive loss that can leverage large amounts of unlabeled data to achieve feature disentanglement. The experimental results demonstrate the effectiveness of our method.
