Table of Contents
Fetching ...

RCT: Random Consistency Training for Semi-supervised Sound Event Detection

Nian Shao, Erfan Loweimi, Xiaofei Li

TL;DR

This work tackles data scarcity in sound event detection by applying semi-supervised learning. It introduces Random Consistency Training (RCT), a framework that integrates hard mixup, RandomWarping data augmentation, and self-consistency loss within a MeanTeacher teacher-student setup. A novel self-consistency term and a label transformation for hard mixup are proposed to stabilize and boost training across unlabeled data. Experiments on the DCASE 2021 Task 4 dataset demonstrate that RCT outperforms several SSL baselines and remains competitive with top submissions, highlighting its robustness and adaptability to audio modalities.

Abstract

Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem of data deficiency. The integration of semi-supervised learning (SSL) largely mitigates such problem while bringing no extra annotation budget. This paper researches on several core modules of SSL, and introduces a random consistency training (RCT) strategy. First, a self-consistency loss is proposed to fuse with the teacher-student model to stabilize the training. Second, a hard mixup data augmentation is proposed to account for the additive property of sounds. Third, a random augmentation scheme is applied to flexibly combine different types of data augmentations. Experiments show that the proposed strategy outperform other widely-used strategies.

RCT: Random Consistency Training for Semi-supervised Sound Event Detection

TL;DR

This work tackles data scarcity in sound event detection by applying semi-supervised learning. It introduces Random Consistency Training (RCT), a framework that integrates hard mixup, RandomWarping data augmentation, and self-consistency loss within a MeanTeacher teacher-student setup. A novel self-consistency term and a label transformation for hard mixup are proposed to stabilize and boost training across unlabeled data. Experiments on the DCASE 2021 Task 4 dataset demonstrate that RCT outperforms several SSL baselines and remains competitive with top submissions, highlighting its robustness and adaptability to audio modalities.

Abstract

Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem of data deficiency. The integration of semi-supervised learning (SSL) largely mitigates such problem while bringing no extra annotation budget. This paper researches on several core modules of SSL, and introduces a random consistency training (RCT) strategy. First, a self-consistency loss is proposed to fuse with the teacher-student model to stabilize the training. Second, a hard mixup data augmentation is proposed to account for the additive property of sounds. Third, a random augmentation scheme is applied to flexibly combine different types of data augmentations. Experiments show that the proposed strategy outperform other widely-used strategies.

Paper Structure

This paper contains 9 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Flowchart of RCT: both hard mixup and audio warping are first applied for data augmentation; MeanTeacher tarvainen2017mean and self-consistency are used for SSL training. Subscripts $R$ and $M$ stand for RandomWarping and hard mixup, respectively.
  • Figure 2: The relative performance gain as a function of maximum transformation magnitude ($d_{max}$). The transformation magnitude $d \sim U[1,d_\textit{max}]$. Relative performance gain is computed using the baseline performance ($\text{PSDS}_1=34.74\%$, $\text{PSDS}_2=53.66\%$). The markers and vertical lines represent the mean and standard deviation computed using three trials.
  • Figure 3: Cross-entropy loss of strongly-supervised validation data when training with or without self-consistency loss, comparing with the ICT scheme verma2019interpolation.