Table of Contents
Fetching ...

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

Pengfei Cai, Yan Song, Nan Jiang, Qing Gu, Ian McLoughlin

TL;DR

PMAM addresses the data-labeling bottleneck in sound event detection by learning self-supervised representations through a Gaussian Mixture Model–based prototypical distribution that generates frame-level pseudo labels for a masked audio model. The approach combines a dual-branch encoder (CNN and Transformer PaSST) with a context transformer, and uses an EM-like iterative refinement of pseudo labels, followed by mean-teacher semi-supervised fine-tuning on a small labeled set. A prototype-wise binary cross-entropy loss enables independent multi-prototype supervision, improving handling of polyphonic events. On DESED, PMAM achieves state-of-the-art PSDS1 scores (up to 62.5%), demonstrating the practical impact of unsupervised learning for fine-grained SED with limited labeled data.

Abstract

A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5\%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique.

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

TL;DR

PMAM addresses the data-labeling bottleneck in sound event detection by learning self-supervised representations through a Gaussian Mixture Model–based prototypical distribution that generates frame-level pseudo labels for a masked audio model. The approach combines a dual-branch encoder (CNN and Transformer PaSST) with a context transformer, and uses an EM-like iterative refinement of pseudo labels, followed by mean-teacher semi-supervised fine-tuning on a small labeled set. A prototype-wise binary cross-entropy loss enables independent multi-prototype supervision, improving handling of polyphonic events. On DESED, PMAM achieves state-of-the-art PSDS1 scores (up to 62.5%), demonstrating the practical impact of unsupervised learning for fine-grained SED with limited labeled data.

Abstract

A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5\%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique.
Paper Structure (16 sections, 4 equations, 3 figures, 2 tables)

This paper contains 16 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The proposed self-supervised iterative PMAM framework. The E-step (bottom) extracts frame-level pseudo labels from latent embeddings using the prototypical distribution modeling module, to train the masked audio model, which predicts pseudo labels of the masked frames during the M-step (top).
  • Figure 2: The point-biserial correlation coefficient matrix between prototype based pseudo labels in the second iteration and real labels. 'None' represents the label of frames when no event occurred. The pseudo labels are reordered to match the sequence of the real labels for better revealing the correlation.
  • Figure 3: Pseudo labels and ground truth corresponding to audio samples in the second iteration.