Table of Contents
Fetching ...

From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning

John Martinsson, Olof Mogren, Maria Sandsten, Tuomas Virtanen

TL;DR

The paper addresses the challenge of obtaining temporally precise strong labels for sound event detection under a fixed annotation budget. It introduces adaptive change point detection (A-CPD) as a machine-guided querying strategy that uses a prediction model’s probability curve and CPD to select informative query segments, within an active learning loop powered by ProtoNet/BirdNET embeddings. Through experiments on Meerkat, Dog, and Baby cry datasets, A-CPD is shown to produce higher quality strong labels (measured by $F_{1e}$ and $F_{1s}$) and better downstream test performance than fixed or non-adaptive strategies, though there remains a gap to an oracle strategy. The work demonstrates a practical, scalable approach to efficiently convert weak labels into high-quality strong labels, enabling more accurate SED and event counting with limited labeling resources and offering a framework for extension to other domains.

Abstract

We propose an adaptive change point detection method (A-CPD) for machine guided weak label annotation of audio recording segments. The goal is to maximize the amount of information gained about the temporal activations of the target sounds. For each unlabeled audio recording, we use a prediction model to derive a probability curve used to guide annotation. The prediction model is initially pre-trained on available annotated sound event data with classes that are disjoint from the classes in the unlabeled dataset. The prediction model then gradually adapts to the annotations provided by the annotator in an active learning loop. We derive query segments to guide the weak label annotator towards strong labels, using change point detection on these probabilities. We show that it is possible to derive strong labels of high quality with a limited annotation budget, and show favorable results for A-CPD when compared to two baseline query segment strategies.

From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning

TL;DR

The paper addresses the challenge of obtaining temporally precise strong labels for sound event detection under a fixed annotation budget. It introduces adaptive change point detection (A-CPD) as a machine-guided querying strategy that uses a prediction model’s probability curve and CPD to select informative query segments, within an active learning loop powered by ProtoNet/BirdNET embeddings. Through experiments on Meerkat, Dog, and Baby cry datasets, A-CPD is shown to produce higher quality strong labels (measured by and ) and better downstream test performance than fixed or non-adaptive strategies, though there remains a gap to an oracle strategy. The work demonstrates a practical, scalable approach to efficiently convert weak labels into high-quality strong labels, enabling more accurate SED and event counting with limited labeling resources and offering a framework for extension to other domains.

Abstract

We propose an adaptive change point detection method (A-CPD) for machine guided weak label annotation of audio recording segments. The goal is to maximize the amount of information gained about the temporal activations of the target sounds. For each unlabeled audio recording, we use a prediction model to derive a probability curve used to guide annotation. The prediction model is initially pre-trained on available annotated sound event data with classes that are disjoint from the classes in the unlabeled dataset. The prediction model then gradually adapts to the annotations provided by the annotator in an active learning loop. We derive query segments to guide the weak label annotator towards strong labels, using change point detection on these probabilities. We show that it is possible to derive strong labels of high quality with a limited annotation budget, and show favorable results for A-CPD when compared to two baseline query segment strategies.
Paper Structure (18 sections, 6 equations, 4 figures, 3 tables)

This paper contains 18 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of segmentation of an audio spectrogram with three target events shown in shaded green (top panel) into a set of audio query segments $q_0, \dots, q_6$ using an optimal method w.r.t the derived strong label timings (middle panel) and a sub-optimal method (bottom panel). Resulting annotations, from the weak labels given by the annotator, are shown in shaded red for both methods. Query $q_4$ for the optimal method is omitted for clarity.
  • Figure 2: Qualitative example of how the different query strategies A-CPD, F-CPD and FIX segment a spectrogram of an audio recording with three target events shown in shaded green (top panel) into $B=7$ queries. A-CPD (second panel) uses change point detection (blue line) on the probability curve from a prediction model (orange line) to detect the $B-1$ most prominent peaks (red crosses) which are used to construct a set of queries $\{q_0, \dots, q_{B-1}\}$ (dashed red lines). Each query $q_i = (s_i, e_i)$ is given a weak label $c_i \in \{0, 1\}$ ($c=1$ shown as shaded red), resulting in the $i$:th annotation $(s_i, e_i, c_i)$. F-CPD (third panel) uses change point detection directly on the cosine distances in embedding space (blue line) and thereafter constructs queries in the same way as A-CPD. FIX (fourth panel) uses fixed length queries.
  • Figure 3: The average $F_{1s}$-score over the three classes for each of the studied annotation processes plotted against the number of queries per audio recording, $B$. The results are shown for an annotator without noise (left) and with $\beta=0.2$ (right). Note that ORC is $1.0$ when $\beta=0$ and is therefore not shown in the left figure. Shaded region where $B \geq B_{\text{suff}}$.
  • Figure 4: The average test time $F_{1s}$-score over the studied sound classes for a ProtoNet (top) and the MLP (bottom) trained with the annotations from each respective annotation process and setting. Shaded region where $B \geq B_{\text{suff}}$.