Table of Contents
Fetching ...

Formula-Supervised Sound Event Detection: Pre-Training Without Real Data

Yuto Shibata, Keitaro Tanaka, Yoshiaki Bando, Keisuke Imoto, Hirokatsu Kataoka, Yoshimitsu Aoki

TL;DR

This work tackles the challenge of obtaining large-scale, accurately labeled training data for sound event detection (SED). It proposes Formula-SED, a formula-driven supervised pre-training framework that synthesizes acoustic signals entirely from mathematical formulas and uses the synthesis parameters as ground-truth labels, eliminating label noise and privacy concerns. Gaussian process-based parameter generation enforces temporal coherence and cross-component correlations among harmonic and inharmonic parts, enabling strong supervision without real data. Pre-training with Formula-SED improves downstream SED performance and accelerates convergence on DESED/DCASE2023 Task 4 benchmarks, sometimes surpassing models trained with real data. This study demonstrates transferable auditory representations learned from synthetic formulas, offering a scalable, privacy-friendly path for environmental sound analysis.

Abstract

In this paper, we propose a novel formula-driven supervised learning (FDSL) framework for pre-training an environmental sound analysis model by leveraging acoustic signals parametrically synthesized through formula-driven methods. Specifically, we outline detailed procedures and evaluate their effectiveness for sound event detection (SED). The SED task, which involves estimating the types and timings of sound events, is particularly challenged by the difficulty of acquiring a sufficient quantity of accurately labeled training data. Moreover, it is well known that manually annotated labels often contain noises and are significantly influenced by the subjective judgment of annotators. To address these challenges, we propose a novel pre-training method that utilizes a synthetic dataset, Formula-SED, where acoustic data are generated solely based on mathematical formulas. The proposed method enables large-scale pre-training by using the synthesis parameters applied at each time step as ground truth labels, thereby eliminating label noise and bias. We demonstrate that large-scale pre-training with Formula-SED significantly enhances model accuracy and accelerates training, as evidenced by our results in the DESED dataset used for DCASE2023 Challenge Task 4. The project page is at https://yutoshibata07.github.io/Formula-SED/

Formula-Supervised Sound Event Detection: Pre-Training Without Real Data

TL;DR

This work tackles the challenge of obtaining large-scale, accurately labeled training data for sound event detection (SED). It proposes Formula-SED, a formula-driven supervised pre-training framework that synthesizes acoustic signals entirely from mathematical formulas and uses the synthesis parameters as ground-truth labels, eliminating label noise and privacy concerns. Gaussian process-based parameter generation enforces temporal coherence and cross-component correlations among harmonic and inharmonic parts, enabling strong supervision without real data. Pre-training with Formula-SED improves downstream SED performance and accelerates convergence on DESED/DCASE2023 Task 4 benchmarks, sometimes surpassing models trained with real data. This study demonstrates transferable auditory representations learned from synthetic formulas, offering a scalable, privacy-friendly path for environmental sound analysis.

Abstract

In this paper, we propose a novel formula-driven supervised learning (FDSL) framework for pre-training an environmental sound analysis model by leveraging acoustic signals parametrically synthesized through formula-driven methods. Specifically, we outline detailed procedures and evaluate their effectiveness for sound event detection (SED). The SED task, which involves estimating the types and timings of sound events, is particularly challenged by the difficulty of acquiring a sufficient quantity of accurately labeled training data. Moreover, it is well known that manually annotated labels often contain noises and are significantly influenced by the subjective judgment of annotators. To address these challenges, we propose a novel pre-training method that utilizes a synthetic dataset, Formula-SED, where acoustic data are generated solely based on mathematical formulas. The proposed method enables large-scale pre-training by using the synthesis parameters applied at each time step as ground truth labels, thereby eliminating label noise and bias. We demonstrate that large-scale pre-training with Formula-SED significantly enhances model accuracy and accelerates training, as evidenced by our results in the DESED dataset used for DCASE2023 Challenge Task 4. The project page is at https://yutoshibata07.github.io/Formula-SED/

Paper Structure

This paper contains 12 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The overview of our proposed method. We effectively pre-train SED models using acoustic data generated solely based on mathematical formulas.
  • Figure 2: Comparison between real data (AudioSet) and our Formula-SED.
  • Figure 3: The training curve of the CRNN baseline DCASE2023Workshop.
  • Figure 4: The impact of pre-training labels on fine-tuning performance.