Table of Contents
Fetching ...

Cross-Referencing Self-Training Network for Sound Event Detection in Audio Mixtures

Sangwook Park, David K. Han, Mounya Elhilali

TL;DR

This work tackles the challenge of training sound event detectors with abundant unlabeled data by introducing Cross-Referencing Self-Training (CRST), a dual-model semi-supervised framework that mitigates self-bias by cross-labeling between a pair of networks trained on original and perturbed data. It couples CRST with a classwise post-processing pipeline that uses Extreme Value Theory-based thresholds and median filtering to extract accurate time intervals for each sound class. Across DESED/DCASE2020 benchmarks, CRST consistently outperforms strong baselines (MT, ICT, SRST) with significant gains in class-averaged F-scores, and the classwise post-processing contributes an additional 2–3% improvement. The approach demonstrates practical effectiveness for leveraging unlabeled data in SED and offers a scalable post-processing strategy to enhance interval-level predictions in real-world audio scenes.

Abstract

Sound event detection is an important facet of audio tagging that aims to identify sounds of interest and define both the sound category and time boundaries for each sound event in a continuous recording. With advances in deep neural networks, there has been tremendous improvement in the performance of sound event detection systems, although at the expense of costly data collection and labeling efforts. In fact, current state-of-the-art methods employ supervised training methods that leverage large amounts of data samples and corresponding labels in order to facilitate identification of sound category and time stamps of events. As an alternative, the current study proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training. Additionally, this paper explores post-processing which extracts sound intervals from network prediction, for further improvement in sound event detection performance. The proposed approach is evaluated on sound event detection task for the DCASE2020 challenge. The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.

Cross-Referencing Self-Training Network for Sound Event Detection in Audio Mixtures

TL;DR

This work tackles the challenge of training sound event detectors with abundant unlabeled data by introducing Cross-Referencing Self-Training (CRST), a dual-model semi-supervised framework that mitigates self-bias by cross-labeling between a pair of networks trained on original and perturbed data. It couples CRST with a classwise post-processing pipeline that uses Extreme Value Theory-based thresholds and median filtering to extract accurate time intervals for each sound class. Across DESED/DCASE2020 benchmarks, CRST consistently outperforms strong baselines (MT, ICT, SRST) with significant gains in class-averaged F-scores, and the classwise post-processing contributes an additional 2–3% improvement. The approach demonstrates practical effectiveness for leveraging unlabeled data in SED and offers a scalable post-processing strategy to enhance interval-level predictions in real-world audio scenes.

Abstract

Sound event detection is an important facet of audio tagging that aims to identify sounds of interest and define both the sound category and time boundaries for each sound event in a continuous recording. With advances in deep neural networks, there has been tremendous improvement in the performance of sound event detection systems, although at the expense of costly data collection and labeling efforts. In fact, current state-of-the-art methods employ supervised training methods that leverage large amounts of data samples and corresponding labels in order to facilitate identification of sound category and time stamps of events. As an alternative, the current study proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training. Additionally, this paper explores post-processing which extracts sound intervals from network prediction, for further improvement in sound event detection performance. The proposed approach is evaluated on sound event detection task for the DCASE2020 challenge. The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.

Paper Structure

This paper contains 29 sections, 16 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An SED model projects inputs in feature space onto label space where $x$ and $x'$ are input features. $y$ is a true label while $\hat{y}$ is the prediction (a) Strategy of consistency-regularization method, (b) Limitation of consistency-regularization.
  • Figure 2: Diagram for self-training framework where $y$, $\hat{y}$, and $\tilde{y}$ is a true label, student network's prediction, and pseudo label, respectively. $x'$ represents manipulated data from the original $x$ by a transformation function $T(.)$. (a) Self-Referencing Self-Training (SRST) model, (b) Cross-Referencing Self-Training (CRST) model.
  • Figure 3: Histograms for posterior distribution in two targets, Dishes and Speech. Once a model is converged in training, posteriors were calculated on weakly labeled data. Red dotted lines show optimal thresholds ($-t_{\alpha}$) for each class while black dotted line (at 0.0 in logit domain) represents a global threshold for all targets.
  • Figure 4: Classwise performance on "validation" set. (a) Classwise fscores of two post-processing methods in four semi-supervised models. The results of classwise post-processing are marked as red solid line while blue dotted line is for the results of global post-processing. The error bar means the standard deviation over the 5 times repetition. (b) The most left table shows the number of sound intervals on "validation" set for each class. Then, four matrices show a confusion in classification for detected intervals which are matched to the truth in time boundaries. The numbers are the mean over the 5 times repetition and the standard deviation is represented to background light in black and white.
  • Figure 5: Classwise performance on "public evaluation" set. (a) Classwise fscores of two post-processing methods in four semi-supervised models. The results of classwise post-processing are marked as red solid line while blue dotted line is for the results of global post-processing. The error bar means the standard deviation over the 5 times repetition. (b) The most left table shows the number of sound intervals on "validation" set for each class. Then, four matrices show a confusion in classification for detected intervals which are matched to the truth in time boundaries. The numbers are the mean over the 5 times repetition and the standard deviation is represented to background light in black and white.
  • ...and 1 more figures