Merlin L48 Spectrogram Dataset
Aaron Sun, Subhransu Maji, Grant Van Horn
TL;DR
The Merlin L48 spectrogram dataset targets the realistic single-positive multi-label (SPML) learning setting in a fine-grained acoustic domain, challenging traditional SPML benchmarks that rely on synthetic label-starvation. The authors benchmark multiple SPML losses (e.g., BCE-AN, LS, EM, ROLE, LL variants) on L48 and introduce asset-level consistency regularization ${\cal R}_P$ and the incorporation of negative label priors via ${\cal L}_{SPML}^-$. They show that real-world SPML on L48 is harder than synthetic COCO-based benchmarks, with LS and EM offering the most robust baselines, while LL variants falter due to fine-grained misclassifications; asset regularization consistently improves performance across methods, and geo/checklist priors provide additional gains though not enough to fully reach full supervision. The dataset, paired with a comprehensive benchmark and open-source tooling, highlights the need for realistic SPML evaluation and motivates further work on semi-supervised signals and problem-specific priors for deployment in ecological and acoustic recognition tasks. Overall, L48 serves as a realistic testbed for SPML methods and reveals opportunities to improve robustness and leverage domain priors in real-world, fine-grained multi-label problems.
Abstract
In the single-positive multi-label (SPML) setting, each image in a dataset is labeled with the presence of a single class, while the true presence of other classes remains unknown. The challenge is to narrow the performance gap between this partially-labeled setting and fully-supervised learning, which often requires a significant annotation budget. Prior SPML methods were developed and benchmarked on synthetic datasets created by randomly sampling single positive labels from fully-annotated datasets like Pascal VOC, COCO, NUS-WIDE, and CUB200. However, this synthetic approach does not reflect real-world scenarios and fails to capture the fine-grained complexities that can lead to difficult misclassifications. In this work, we introduce the L48 dataset, a fine-grained, real-world multi-label dataset derived from recordings of bird sounds. L48 provides a natural SPML setting with single-positive annotations on a challenging, fine-grained domain, as well as two extended settings in which domain priors give access to additional negative labels. We benchmark existing SPML methods on L48 and observe significant performance differences compared to synthetic datasets and analyze method weaknesses, underscoring the need for more realistic and difficult benchmarks.
