Table of Contents
Fetching ...

InsectSet459: an open dataset of insect sounds for bioacoustic machine learning

Marius Faiß, Burooj Ghani, Dan Stowell

TL;DR

InsectSet459 addresses the need for scalable insect sound datasets to support deep-learning-based monitoring by assembling a large, open collection of 26,399 recordings from 459 species across Orthoptera and Cicadidae. The dataset is multi-source, license-friendly, and preserves high-frequency information by avoiding artificial down-sampling, with a 60/20/20 train/validation/test split and 2-minute clip truncation to maximize diversity. Baseline results using InsectEffNet (CNN) and PaSST (Transformer) yield macro-F1 scores around 56–58% on IS459, with strong performance for common species but notable drops for rare ones, underscoring data scarcity and spectral coverage challenges. The work highlights the potential of open, multi-rate insect sound data to drive methodological advances, including multi-rate representations and self-supervised learning, to enable scalable, automated insect monitoring in real-world environments.

Abstract

Automatic recognition of insect sound could help us understand changing biodiversity trends around the world -- but insect sounds are challenging to recognize even for deep learning. We present a new dataset comprised of 26399 audio files, from 459 species of Orthoptera and Cicadidae. It is the first large-scale dataset of insect sound that is easily applicable for developing novel deep-learning methods. Its recordings were made with a variety of audio recorders using varying sample rates to capture the extremely broad range of frequencies that insects produce. We benchmark performance with two state-of-the-art deep learning classifiers, demonstrating good performance but also significant room for improvement in acoustic insect classification. This dataset can serve as a realistic test case for implementing insect monitoring workflows, and as a challenging basis for the development of audio representation methods that can handle highly variable frequencies and/or sample rates.

InsectSet459: an open dataset of insect sounds for bioacoustic machine learning

TL;DR

InsectSet459 addresses the need for scalable insect sound datasets to support deep-learning-based monitoring by assembling a large, open collection of 26,399 recordings from 459 species across Orthoptera and Cicadidae. The dataset is multi-source, license-friendly, and preserves high-frequency information by avoiding artificial down-sampling, with a 60/20/20 train/validation/test split and 2-minute clip truncation to maximize diversity. Baseline results using InsectEffNet (CNN) and PaSST (Transformer) yield macro-F1 scores around 56–58% on IS459, with strong performance for common species but notable drops for rare ones, underscoring data scarcity and spectral coverage challenges. The work highlights the potential of open, multi-rate insect sound data to drive methodological advances, including multi-rate representations and self-supervised learning, to enable scalable, automated insect monitoring in real-world environments.

Abstract

Automatic recognition of insect sound could help us understand changing biodiversity trends around the world -- but insect sounds are challenging to recognize even for deep learning. We present a new dataset comprised of 26399 audio files, from 459 species of Orthoptera and Cicadidae. It is the first large-scale dataset of insect sound that is easily applicable for developing novel deep-learning methods. Its recordings were made with a variety of audio recorders using varying sample rates to capture the extremely broad range of frequencies that insects produce. We benchmark performance with two state-of-the-art deep learning classifiers, demonstrating good performance but also significant room for improvement in acoustic insect classification. This dataset can serve as a realistic test case for implementing insect monitoring workflows, and as a challenging basis for the development of audio representation methods that can handle highly variable frequencies and/or sample rates.

Paper Structure

This paper contains 13 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 3: The geographic locations of the files contained in InsectSet459. Out of 26399 files in the dataset, 305 do not have coordinates associated with them. (a) Upper plot: Recording locations split by source dataset. Blue dots indicate recordings sourced from iNaturalist, red dots show recordings from xeno-canto. Purple shows the overlap of both datasets. (b) Lower plot: Recording locations split into the training, validation and test sets. Blue dots indicate recordings in the training set, red dots show recordings in the testing subset and green dots show recordings in the validation subsets. Other colors show the overlap between the subsets.
  • Figure 4: a) Left plot: The number of files for all 459 species in the dataset, sorted in descending rank order and scaled logarithmically. This illustrates the strongly imbalanced nature of the data, a common aspect of bioacoustic datasets. b) Right plot: The distribution of file durations in the dataset. Most files are around 10 seconds long. The large peak at 120 seconds is the result of trimming longer files to a maximum of two minutes.
  • Figure 5: Summary of the taxonomic groups represented within InsectSet459. The hierarchically-nested boxes represent taxonomic groups (some intermediate ranks such as subfamilies have been omitted for simplicity). For each box, the size indicates the number of species represented in the dataset, and the darkness of the box indicates the total number of sound recordings included.
  • Figure 6: Per-species classification performance of classifiers trained on IS459, evaluated on the test set. The species are ordered on the x-axis from most common to least common in the dataset, and for each one we plot a running mean of the F1 score---meaning the mean of the F1 score for all species that are equally or more common.
  • Figure 7: Per-species classification performance of InsectEffNet trained on InsectSet66, evaluated on the test set. The species are ordered on the x-axis from most common to least common in the dataset, and for each one we plot a running mean of the F1 score---meaning the mean of the F1 score for all species that are equally or more common.
  • ...and 1 more figures