Biodenoising: Animal Vocalization Denoising without Access to Clean Data
Marius Miron, Sara Keen, Jen-Yu Liu, Benjamin Hoffman, Masato Hagiwara, Olivier Pietquin, Felix Effenberger, Maddie Cusimano
TL;DR
This work tackles denoising animal vocalizations without access to clean data by leveraging speech-enhancement models to create pseudo-clean targets, then retraining on synthetic mixtures that combine noise with those targets. It introduces two non-overlapping datasets—a large, diverse training set and a carefully curated benchmark set—across multiple taxa and environments, and demonstrates that models trained with pseudo-clean targets (via noisereduce or Demucs) achieve substantially better denoising performance than traditional noisy-target training, with robust generalization across underwater and terrestrial noise. A key finding is that pre-denoising priors from speech models can transfer to bioacoustics, offering a practical path to develop denoising tools for many species lacking clean recordings. The approach enables more reliable bioacoustic analyses and playback experiments, with potential for broader adoption in wildlife monitoring and conservation research, especially as the method scales to higher sample rates in future work.
Abstract
Animal vocalization denoising is a task similar to human speech enhancement, which is relatively well-studied. In contrast to the latter, it comprises a higher diversity of sound production mechanisms and recording environments, and this higher diversity is a challenge for existing models. Adding to the challenge and in contrast to speech, we lack large and diverse datasets comprising clean vocalizations. As a solution we use as training data pseudo-clean targets, i.e. pre-denoised vocalizations, and segments of background noise without a vocalization. We propose a train set derived from bioacoustics datasets and repositories representing diverse species, acoustic environments, geographic regions. Additionally, we introduce a non-overlapping benchmark set comprising clean vocalizations from different taxa and noise samples. We show that that denoising models (demucs, CleanUNet) trained on pseudo-clean targets obtained with speech enhancement models achieve competitive results on the benchmarking set. We publish data, code, libraries, and demos at https://earthspecies.github.io/biodenoising/.
