InsectSet459: an open dataset of insect sounds for bioacoustic machine learning
Marius Faiß, Burooj Ghani, Dan Stowell
TL;DR
InsectSet459 addresses the need for scalable insect sound datasets to support deep-learning-based monitoring by assembling a large, open collection of 26,399 recordings from 459 species across Orthoptera and Cicadidae. The dataset is multi-source, license-friendly, and preserves high-frequency information by avoiding artificial down-sampling, with a 60/20/20 train/validation/test split and 2-minute clip truncation to maximize diversity. Baseline results using InsectEffNet (CNN) and PaSST (Transformer) yield macro-F1 scores around 56–58% on IS459, with strong performance for common species but notable drops for rare ones, underscoring data scarcity and spectral coverage challenges. The work highlights the potential of open, multi-rate insect sound data to drive methodological advances, including multi-rate representations and self-supervised learning, to enable scalable, automated insect monitoring in real-world environments.
Abstract
Automatic recognition of insect sound could help us understand changing biodiversity trends around the world -- but insect sounds are challenging to recognize even for deep learning. We present a new dataset comprised of 26399 audio files, from 459 species of Orthoptera and Cicadidae. It is the first large-scale dataset of insect sound that is easily applicable for developing novel deep-learning methods. Its recordings were made with a variety of audio recorders using varying sample rates to capture the extremely broad range of frequencies that insects produce. We benchmark performance with two state-of-the-art deep learning classifiers, demonstrating good performance but also significant room for improvement in acoustic insect classification. This dataset can serve as a realistic test case for implementing insect monitoring workflows, and as a challenging basis for the development of audio representation methods that can handle highly variable frequencies and/or sample rates.
