Table of Contents
Fetching ...

Misophonia Trigger Sound Detection on Synthetic Soundscapes Using a Hybrid Model with a Frozen Pre-Trained CNN and a Time-Series Module

Kurumi Sashida, Gouhei Tanaka

TL;DR

This paper tackles misophonia trigger sound detection by generating a scalable, strongly labeled synthetic dataset to overcome real-world data scarcity and evaluating a hybrid CNN-based model with a frozen front-end paired with trainable time-series modules. The authors compare Linear, GRU, LSTM, and ESN temporal back-ends, including bidirectional variants, and assess both multi-class detection and a few-shot personalization scenario. Key findings show that BiGRU delivers the highest detection accuracy, but a Bidirectional ESN achieves competitive performance with orders of magnitude fewer trainable parameters, making it attractive for on-device personalization. The work advances practical, real-time misophonia mitigation by highlighting the trade-offs between accuracy and model footprint, and points to domain adaptation as a path toward real-world robustness.

Abstract

Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot "eating sound" detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.

Misophonia Trigger Sound Detection on Synthetic Soundscapes Using a Hybrid Model with a Frozen Pre-Trained CNN and a Time-Series Module

TL;DR

This paper tackles misophonia trigger sound detection by generating a scalable, strongly labeled synthetic dataset to overcome real-world data scarcity and evaluating a hybrid CNN-based model with a frozen front-end paired with trainable time-series modules. The authors compare Linear, GRU, LSTM, and ESN temporal back-ends, including bidirectional variants, and assess both multi-class detection and a few-shot personalization scenario. Key findings show that BiGRU delivers the highest detection accuracy, but a Bidirectional ESN achieves competitive performance with orders of magnitude fewer trainable parameters, making it attractive for on-device personalization. The work advances practical, real-time misophonia mitigation by highlighting the trade-offs between accuracy and model footprint, and points to domain adaptation as a path toward real-world robustness.

Abstract

Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot "eating sound" detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.
Paper Structure (20 sections, 1 equation, 3 figures, 2 tables)

This paper contains 20 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of audio classification and sound event detection (SED). Left: audio classification assigns clip-level labels without temporal localization. Right: SED aims to identify event categories and estimate their onset and offset times within an audio clip.
  • Figure 2: Frozen CNN + temporal module pipeline for frame-wise multi-label sound event detection. A pre-trained frame-wise MobileNetV3 backbone extracts embeddings $\mathbf{z}_t$, which are processed by a temporal module (GRU/LSTM/ESN; uni- or bidirectional variants) and a time-shared linear + sigmoid readout to yield per-frame class posteriors.
  • Figure 3: Few-shot user-adaptive trigger SED performance measured by PSDS1 as a function of the number of support clips $K$. In each run, $K$ support clips are sampled from a fixed pool of eight recordings using a different random seed. Points indicate the mean over 10 runs, and error bars denote $\pm 1$ standard deviation.