Table of Contents
Fetching ...

Merlin L48 Spectrogram Dataset

Aaron Sun, Subhransu Maji, Grant Van Horn

TL;DR

The Merlin L48 spectrogram dataset targets the realistic single-positive multi-label (SPML) learning setting in a fine-grained acoustic domain, challenging traditional SPML benchmarks that rely on synthetic label-starvation. The authors benchmark multiple SPML losses (e.g., BCE-AN, LS, EM, ROLE, LL variants) on L48 and introduce asset-level consistency regularization ${\cal R}_P$ and the incorporation of negative label priors via ${\cal L}_{SPML}^-$. They show that real-world SPML on L48 is harder than synthetic COCO-based benchmarks, with LS and EM offering the most robust baselines, while LL variants falter due to fine-grained misclassifications; asset regularization consistently improves performance across methods, and geo/checklist priors provide additional gains though not enough to fully reach full supervision. The dataset, paired with a comprehensive benchmark and open-source tooling, highlights the need for realistic SPML evaluation and motivates further work on semi-supervised signals and problem-specific priors for deployment in ecological and acoustic recognition tasks. Overall, L48 serves as a realistic testbed for SPML methods and reveals opportunities to improve robustness and leverage domain priors in real-world, fine-grained multi-label problems.

Abstract

In the single-positive multi-label (SPML) setting, each image in a dataset is labeled with the presence of a single class, while the true presence of other classes remains unknown. The challenge is to narrow the performance gap between this partially-labeled setting and fully-supervised learning, which often requires a significant annotation budget. Prior SPML methods were developed and benchmarked on synthetic datasets created by randomly sampling single positive labels from fully-annotated datasets like Pascal VOC, COCO, NUS-WIDE, and CUB200. However, this synthetic approach does not reflect real-world scenarios and fails to capture the fine-grained complexities that can lead to difficult misclassifications. In this work, we introduce the L48 dataset, a fine-grained, real-world multi-label dataset derived from recordings of bird sounds. L48 provides a natural SPML setting with single-positive annotations on a challenging, fine-grained domain, as well as two extended settings in which domain priors give access to additional negative labels. We benchmark existing SPML methods on L48 and observe significant performance differences compared to synthetic datasets and analyze method weaknesses, underscoring the need for more realistic and difficult benchmarks.

Merlin L48 Spectrogram Dataset

TL;DR

The Merlin L48 spectrogram dataset targets the realistic single-positive multi-label (SPML) learning setting in a fine-grained acoustic domain, challenging traditional SPML benchmarks that rely on synthetic label-starvation. The authors benchmark multiple SPML losses (e.g., BCE-AN, LS, EM, ROLE, LL variants) on L48 and introduce asset-level consistency regularization and the incorporation of negative label priors via . They show that real-world SPML on L48 is harder than synthetic COCO-based benchmarks, with LS and EM offering the most robust baselines, while LL variants falter due to fine-grained misclassifications; asset regularization consistently improves performance across methods, and geo/checklist priors provide additional gains though not enough to fully reach full supervision. The dataset, paired with a comprehensive benchmark and open-source tooling, highlights the need for realistic SPML evaluation and motivates further work on semi-supervised signals and problem-specific priors for deployment in ecological and acoustic recognition tasks. Overall, L48 serves as a realistic testbed for SPML methods and reveals opportunities to improve robustness and leverage domain priors in real-world, fine-grained multi-label problems.

Abstract

In the single-positive multi-label (SPML) setting, each image in a dataset is labeled with the presence of a single class, while the true presence of other classes remains unknown. The challenge is to narrow the performance gap between this partially-labeled setting and fully-supervised learning, which often requires a significant annotation budget. Prior SPML methods were developed and benchmarked on synthetic datasets created by randomly sampling single positive labels from fully-annotated datasets like Pascal VOC, COCO, NUS-WIDE, and CUB200. However, this synthetic approach does not reflect real-world scenarios and fails to capture the fine-grained complexities that can lead to difficult misclassifications. In this work, we introduce the L48 dataset, a fine-grained, real-world multi-label dataset derived from recordings of bird sounds. L48 provides a natural SPML setting with single-positive annotations on a challenging, fine-grained domain, as well as two extended settings in which domain priors give access to additional negative labels. We benchmark existing SPML methods on L48 and observe significant performance differences compared to synthetic datasets and analyze method weaknesses, underscoring the need for more realistic and difficult benchmarks.

Paper Structure

This paper contains 30 sections, 1 equation, 10 figures, 13 tables.

Figures (10)

  • Figure 1: The Merlin L48 Spectrogram (L48) dataset spans the Lower 48 states of the US with bird recordings throughout the year. Each recording is associated with a target species (solid) but also contains background species (dashed), giving rise to a natural single-positive, multi-label (SPML) task. L48 stands out among similar datasets as being at country-wide, year-round scale while still maintaining high-quality bounding box annotations (see Table \ref{['tab:dataset-comp']}a).
  • Figure 2: An illustration of how time and frequency overlaps can cause distortions in the resulting spectrogram. Images are underlined with corresponding box colors. Left, an image with 8 different species vocalizing (from left to right: Mourning Dove, Blue Jay, Yellow-rumped Warbler, Chipping Sparrow, Tufted Titmouse, Brown-headed Cowbird, American Robin, Northern Cardinal). Right, the vocalizations of Black-throated Green Warbler (green) and Chipping Sparrow (red) are depicted and the two birds are shown vocalizing simultaneously.
  • Figure 3: Examples of difficult vocalizations in the L48. The bottom right shows a Northern Mockingbird imitating other birds in its long song, while the others show confusing species pairs. From left to right, the top row shows Red-eyed Vireo, Philadelphia Vireo, Chipping Sparrow, and Pine Warbler songs, while the bottom row shows Yellow-bellied Sapsucker and Red-breasted Sapsucker.
  • Figure 4: Overview of results. Left: L48 (leftmost) and COCO (middle) mAP performance distributions across five trials of each SPML method for four different data regimes, shown as box plots and lines. For each box, the thin lines shows the 1.5x interquartile range, the box shows the interquartile range, and the horizontal line shows the median. In parentheses, the proportion or number of annotated labels is given, with + signifying positive labels and - signifying negative labels. The mean performance of three methods are plotted: BCE-AN, LS, and LL-R. For L48 we show the target-only performance with asset regularization. Right: L48 (left) and COCO (right) class-averaged precision-recall curves for BCE-Full and BCE-AN.
  • Figure 5: In-depth method performance analysis. (a-b): Per-class average precision on BCE-AN compared to BCE-Full for both datasets, where dot size is proportional to class frequency in the test set. Classes which perform worse in SPML training fall below the diagonal line. Method names are given in axes, with mAP in parentheses as percentages. (c-d): Precision-recall curves for various methods on various data regimes of the L48.
  • ...and 5 more figures