Table of Contents
Fetching ...

FSD50K: An Open Dataset of Human-Labeled Sound Events

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra

TL;DR

FSD50K addresses the shortage of open, large-vocabulary sound event datasets by introducing a freely distributable collection of 51,197 Freesound clips labeled with 200 AudioSet Ontology classes. The authors detail an end-to-end creation workflow combining automatic nomination, manual validation, expert refinement, and hierarchical label propagation to produce a robust ground truth, with a carefully balanced development/evaluation split that minimizes uploader contamination. Baseline experiments across CRNN, VGG-like, DenseNet, and ResNet architectures establish practical benchmarks and reveal that compact, audio-informed models can outperform larger off-the-shelf CV networks on this diverse, noisy dataset. The work also discusses label noise, data bias, and vocabulary granularity, offering guidance for researchers to use FSD50K effectively and to extend large-vocabulary open SER benchmarks. Overall, FSD50K advances reproducible, open-domain SER research and enables cross-dataset and domain-adaptation studies by aligning with AudioSet’s ontology while providing access to raw waveforms and richer ground truth than prior open sets.

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

FSD50K: An Open Dataset of Human-Labeled Sound Events

TL;DR

FSD50K addresses the shortage of open, large-vocabulary sound event datasets by introducing a freely distributable collection of 51,197 Freesound clips labeled with 200 AudioSet Ontology classes. The authors detail an end-to-end creation workflow combining automatic nomination, manual validation, expert refinement, and hierarchical label propagation to produce a robust ground truth, with a carefully balanced development/evaluation split that minimizes uploader contamination. Baseline experiments across CRNN, VGG-like, DenseNet, and ResNet architectures establish practical benchmarks and reveal that compact, audio-informed models can outperform larger off-the-shelf CV networks on this diverse, noisy dataset. The work also discusses label noise, data bias, and vocabulary granularity, offering guidance for researchers to use FSD50K effectively and to extend large-vocabulary open SER benchmarks. Overall, FSD50K advances reproducible, open-domain SER research and enables cross-dataset and domain-adaptation studies by aligning with AudioSet’s ontology while providing access to raw waveforms and richer ground truth than prior open sets.

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

Paper Structure

This paper contains 56 sections, 4 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overall process of the creation of FSD50K. The process starts from Freesound and the AudioSet Ontology. Stages in green involve automatic data mining, stages in orange correspond to manual annotation tasks, and stages in blue involve data processing to shape the dataset.
  • Figure 2: Screenshot of the "Training phase" page used for the validation task.
  • Figure 3: Screenshot of the "Validation phase" used for the validation task.
  • Figure 4: Table for exploring the ontology in the refinement task.
  • Figure 5: Label distributions in dev (left) and eval (right) sets. Clips in eval tend to have more labels (by dataset curation). Xaxis scale is logarithmic. Number of labels is reported in the unpropagated form. Note that visualization span differs among plots.
  • ...and 3 more figures