FSD50K: An Open Dataset of Human-Labeled Sound Events
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra
TL;DR
FSD50K addresses the shortage of open, large-vocabulary sound event datasets by introducing a freely distributable collection of 51,197 Freesound clips labeled with 200 AudioSet Ontology classes. The authors detail an end-to-end creation workflow combining automatic nomination, manual validation, expert refinement, and hierarchical label propagation to produce a robust ground truth, with a carefully balanced development/evaluation split that minimizes uploader contamination. Baseline experiments across CRNN, VGG-like, DenseNet, and ResNet architectures establish practical benchmarks and reveal that compact, audio-informed models can outperform larger off-the-shelf CV networks on this diverse, noisy dataset. The work also discusses label noise, data bias, and vocabulary granularity, offering guidance for researchers to use FSD50K effectively and to extend large-vocabulary open SER benchmarks. Overall, FSD50K advances reproducible, open-domain SER research and enables cross-dataset and domain-adaptation studies by aligning with AudioSet’s ontology while providing access to raw waveforms and richer ground truth than prior open sets.
Abstract
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.
