Table of Contents
Fetching ...

LEAD Dataset: How Can Labels for Sound Event Detection Vary Depending on Annotators?

Naoki Koga, Yoshiaki Bando, Keisuke Imoto

TL;DR

The paper tackles how strong labels for sound event detection vary across annotators and datasets, which can bias SED models and evaluations. It introduces LEAD, a large-scale dataset with 20 annotators per clip across multiple datasets, including two confidence scores per event, to quantify class-level and temporal variations. Through analysis and pseudo-SED experiments, it demonstrates that temporal onset/offset variation can dramatically degrade event-based evaluation and that collar settings influence measured performance, highlighting the need for robust metrics like PSDS. The work provides a valuable resource for developing annotator-aware SED models and more robust evaluation protocols, with practical implications for dataset construction and model assessment.

Abstract

In this paper, we introduce a LargE-scale Annotator's labels for sound event Detection (LEAD) dataset, which is the dataset used to gain a better understanding of the variation in strong labels in sound event detection (SED). In SED, it is very time-consuming to collect large-scale strong labels, and in most cases, multiple workers divide up the annotations to create a single dataset. In general, strong labels created by multiple annotators have large variations in the type of sound events and temporal onset/offset. Through the annotations of multiple workers, uniquely determining the strong label is quite difficult because the dataset contains sounds that can be mistaken for similar classes and sounds whose temporal onset/offset is difficult to distinguish. If the strong labels of SED vary greatly depending on the annotator, the SED model trained on a dataset created by multiple annotators will be biased. Moreover, if annotators differ between training and evaluation data, there is a risk that the model cannot be evaluated correctly. To investigate the variation in strong labels, we release the LEAD dataset, which provides distinct strong labels for each clip annotated by 20 different annotators. The LEAD dataset allows us to investigate how strong labels vary from annotator to annotator and consider SED models that are robust to the variation of strong labels. The LEAD dataset consists of strong labels assigned to sound clips from TUT Sound Events 2016/2017, TUT Acoustic Scenes 2016, and URBAN-SED. We also analyze variations in the strong labels in the LEAD dataset and provide insights into the variations.

LEAD Dataset: How Can Labels for Sound Event Detection Vary Depending on Annotators?

TL;DR

The paper tackles how strong labels for sound event detection vary across annotators and datasets, which can bias SED models and evaluations. It introduces LEAD, a large-scale dataset with 20 annotators per clip across multiple datasets, including two confidence scores per event, to quantify class-level and temporal variations. Through analysis and pseudo-SED experiments, it demonstrates that temporal onset/offset variation can dramatically degrade event-based evaluation and that collar settings influence measured performance, highlighting the need for robust metrics like PSDS. The work provides a valuable resource for developing annotator-aware SED models and more robust evaluation protocols, with practical implications for dataset construction and model assessment.

Abstract

In this paper, we introduce a LargE-scale Annotator's labels for sound event Detection (LEAD) dataset, which is the dataset used to gain a better understanding of the variation in strong labels in sound event detection (SED). In SED, it is very time-consuming to collect large-scale strong labels, and in most cases, multiple workers divide up the annotations to create a single dataset. In general, strong labels created by multiple annotators have large variations in the type of sound events and temporal onset/offset. Through the annotations of multiple workers, uniquely determining the strong label is quite difficult because the dataset contains sounds that can be mistaken for similar classes and sounds whose temporal onset/offset is difficult to distinguish. If the strong labels of SED vary greatly depending on the annotator, the SED model trained on a dataset created by multiple annotators will be biased. Moreover, if annotators differ between training and evaluation data, there is a risk that the model cannot be evaluated correctly. To investigate the variation in strong labels, we release the LEAD dataset, which provides distinct strong labels for each clip annotated by 20 different annotators. The LEAD dataset allows us to investigate how strong labels vary from annotator to annotator and consider SED models that are robust to the variation of strong labels. The LEAD dataset consists of strong labels assigned to sound clips from TUT Sound Events 2016/2017, TUT Acoustic Scenes 2016, and URBAN-SED. We also analyze variations in the strong labels in the LEAD dataset and provide insights into the variations.

Paper Structure

This paper contains 12 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of LEAD dataset
  • Figure 2: TUT Acoustic Scenes 2016
  • Figure 3: TUT Sound Events 2016/2017
  • Figure 4: URBAN-SED
  • Figure 6: "bird_singing" in b006.wav
  • ...and 3 more figures