Onset and offset weighted loss function for sound event detection

Tao Song

Onset and offset weighted loss function for sound event detection

Tao Song

TL;DR

The paper addresses onset/offset timing errors in sound event detection (SED) that are not fully mitigated by median filtering. It introduces onset and offset weighted binary cross-entropy (OWBCE), a frame-level loss that uses per-frame weights derived from onset/offset cues via first-order differences and a sinusoidal window, with hyperparameters $\alpha$ and $\sigma$ shaping the emphasis. OWBCE is evaluated on the DESED dataset across multiple CRNN-based models, showing clear improvements in event-F1 and PSDS metrics, and robustness to timestamp annotation noise. The results demonstrate practical benefits for more accurate timestamp estimation in SED and suggest broad applicability across architectures and data conditions.

Abstract

In a typical sound event detection (SED) system, the existence of a sound event is detected at a frame level, and consecutive frames with the same event detected are combined as one sound event. The median filter is applied as a post-processing step to remove detection errors as much as possible. However, detection errors occurring around the onset and offset of a sound event are beyond the capacity of the median filter. To address this issue, an onset and offset weighted binary cross-entropy (OWBCE) loss function is proposed in this paper, which trains the DNN model to be more robust on frames around (a) onsets and offsets. Experiments are carried out in the context of DCASE 2022 task 4. Results show that OWBCE outperforms BCE when different models are considered. For a basic CRNN, relative improvements of 6.43% in event-F1, 1.96% in PSDS1, and 2.43% in PSDS2 can be achieved by OWBCE.

Onset and offset weighted loss function for sound event detection

TL;DR

and

shaping the emphasis. OWBCE is evaluated on the DESED dataset across multiple CRNN-based models, showing clear improvements in event-F1 and PSDS metrics, and robustness to timestamp annotation noise. The results demonstrate practical benefits for more accurate timestamp estimation in SED and suggest broad applicability across architectures and data conditions.

Abstract

Paper Structure (10 sections, 4 equations, 4 figures, 4 tables)

This paper contains 10 sections, 4 equations, 4 figures, 4 tables.

Introduction
related work
Proposed method
Experiment
Dataset
Experimental setting
Hyper-parameter optimization
Robustness to human errors in timestamp annotation
Effectiveness on other models
conclusion

Figures (4)

Figure 1: Two examples of median filtering on detection results. Detection results are denoted with filled rectangles, and the real onset and offset are denoted with red dashed lines. In (a), detection errors that occur within or away from the sound event are removed by the median filter. In (b), detection errors that occur around the onset are also removed by the median filter, but new errors are introduced.
Figure 2: Calculation of OWBCE.
Figure 3: A example of the loss weighting function.
Figure 4: Performance gain of OWBCE to BCE as a function as the width and height of sin window. Width is set to 7 in (a) and height is set to 12 in (b).

Onset and offset weighted loss function for sound event detection

TL;DR

Abstract

Onset and offset weighted loss function for sound event detection

Authors

TL;DR

Abstract

Table of Contents

Figures (4)