Table of Contents
Fetching ...

Contrastive Loss Based Frame-wise Feature disentanglement for Polyphonic Sound Event Detection

Yadong Guan, Jiqing Han, Hongwei Song, Wenjie Song, Guibin Zheng, Tieran Zheng, Yongjun He

TL;DR

This work tackles polyphonic SED by addressing entangled frame-wise features that impair discrimination when events overlap. It introduces category-specific projectors to learn category-aligned subspaces and a frame-wise contrastive loss to maximize common information among same-category frames, coupled with a semi-supervised extension that leverages unlabeled data via epoch-weighted pseudo-labels. Ablation studies on the DESED dataset show consistent improvements in PSDS metrics, especially PSDS2 for overlapping events, validating the disentanglement approach with only modest increases in parameters. The method offers a practical route to more accurate end-to-end SED in noisy, real-world environments while reducing labeling requirements through semi-supervised learning.

Abstract

Overlapping sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlapping events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning framework to learn a category-specific representation. Specifically, we employ different projectors to learn the frame-wise features for each category. To ensure that these feature does not contain information of other categories, we maximize the common information between frame-wise features within the same category and propose a frame-wise contrastive loss. In addition, considering that the labeled data used by the proposed method is limited, we propose a semi-supervised frame-wise contrastive loss that can leverage large amounts of unlabeled data to achieve feature disentanglement. The experimental results demonstrate the effectiveness of our method.

Contrastive Loss Based Frame-wise Feature disentanglement for Polyphonic Sound Event Detection

TL;DR

This work tackles polyphonic SED by addressing entangled frame-wise features that impair discrimination when events overlap. It introduces category-specific projectors to learn category-aligned subspaces and a frame-wise contrastive loss to maximize common information among same-category frames, coupled with a semi-supervised extension that leverages unlabeled data via epoch-weighted pseudo-labels. Ablation studies on the DESED dataset show consistent improvements in PSDS metrics, especially PSDS2 for overlapping events, validating the disentanglement approach with only modest increases in parameters. The method offers a practical route to more accurate end-to-end SED in noisy, real-world environments while reducing labeling requirements through semi-supervised learning.

Abstract

Overlapping sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlapping events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning framework to learn a category-specific representation. Specifically, we employ different projectors to learn the frame-wise features for each category. To ensure that these feature does not contain information of other categories, we maximize the common information between frame-wise features within the same category and propose a frame-wise contrastive loss. In addition, considering that the labeled data used by the proposed method is limited, we propose a semi-supervised frame-wise contrastive loss that can leverage large amounts of unlabeled data to achieve feature disentanglement. The experimental results demonstrate the effectiveness of our method.
Paper Structure (9 sections, 8 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 9 sections, 8 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Diagram of our approach. The circles in the dashed box are three features derived from the "Speaking" projector. They all contain "Speaking". Two of them also contain other events. Maximizing the mutual information of pairwise features will reduce irrelevant information in overlapping events, as shown on the right side.
  • Figure 2: Overall architecture of the proposed method. The Proj denotes the category-specific projector.
  • Figure 3: Class-wise feature visualization of the baseline and our method. Blue and pink colors indicate that the ground-truth label is "1" and "0", respectively.