Table of Contents
Fetching ...

Generative AI-based data augmentation for improved bioacoustic classification in noisy environments

Anthony Gibbons, Emma King, Ian Donohue, Andrew Parnell

TL;DR

This work tackles data scarcity in bioacoustic species classification by introducing generative AI-based spectrogram augmentation using ACGAN and diffusion models. It demonstrates that Stable Diffusion–in a latent-diffusion framework–produces high-quality, diverse spectrograms that, when added to real data, improve classifier performance in noisy wind-farm environments. Although synthetic augmentation yields notable gains, especially in validation, it does not surpass a strong BirdNET baseline on human-labelled test data, highlighting ongoing challenges with pseudo-label bias and data representativeness. The study provides practical insights and resources for incorporating synthetic data into ecoacoustic pipelines and establishes a foundation for broader applications across taxa and habitats.

Abstract

Obtaining data to train robust artificial intelligence (AI)-based models for species classification can be challenging, particularly for rare species. Data augmentation can boost classification accuracy by increasing the diversity of training data and is cheaper to obtain than expert-labelled data. However, many classic image-based augmentation techniques are not suitable for audio spectrograms. We investigate two generative AI models as data augmentation tools to synthesise spectrograms and supplement audio data: Auxiliary Classifier Generative Adversarial Networks (ACGAN) and Denoising Diffusion Probabilistic Models (DDPMs). The latter performed particularly well in terms of both realism of generated spectrograms and accuracy in a resulting classification task. Alongside these new approaches, we present a new audio data set of 640 hours of bird calls from wind farm sites in Ireland, approximately 800 samples of which have been labelled by experts. Wind farm data are particularly challenging for classification models given the background wind and turbine noise. Training an ensemble of classification models on real and synthetic data combined compared well with highly confident BirdNET predictions. Each classifier we used was improved by including synthetic data, and classification metrics generally improved in line with the amount of synthetic data added. Our approach can be used to augment acoustic signals for more species and other land-use types, and has the potential to bring about advances in our capacity to develop reliable AI-based detection of rare species. Our code is available at https://github.com/gibbona1/SpectrogramGenAI.

Generative AI-based data augmentation for improved bioacoustic classification in noisy environments

TL;DR

This work tackles data scarcity in bioacoustic species classification by introducing generative AI-based spectrogram augmentation using ACGAN and diffusion models. It demonstrates that Stable Diffusion–in a latent-diffusion framework–produces high-quality, diverse spectrograms that, when added to real data, improve classifier performance in noisy wind-farm environments. Although synthetic augmentation yields notable gains, especially in validation, it does not surpass a strong BirdNET baseline on human-labelled test data, highlighting ongoing challenges with pseudo-label bias and data representativeness. The study provides practical insights and resources for incorporating synthetic data into ecoacoustic pipelines and establishes a foundation for broader applications across taxa and habitats.

Abstract

Obtaining data to train robust artificial intelligence (AI)-based models for species classification can be challenging, particularly for rare species. Data augmentation can boost classification accuracy by increasing the diversity of training data and is cheaper to obtain than expert-labelled data. However, many classic image-based augmentation techniques are not suitable for audio spectrograms. We investigate two generative AI models as data augmentation tools to synthesise spectrograms and supplement audio data: Auxiliary Classifier Generative Adversarial Networks (ACGAN) and Denoising Diffusion Probabilistic Models (DDPMs). The latter performed particularly well in terms of both realism of generated spectrograms and accuracy in a resulting classification task. Alongside these new approaches, we present a new audio data set of 640 hours of bird calls from wind farm sites in Ireland, approximately 800 samples of which have been labelled by experts. Wind farm data are particularly challenging for classification models given the background wind and turbine noise. Training an ensemble of classification models on real and synthetic data combined compared well with highly confident BirdNET predictions. Each classifier we used was improved by including synthetic data, and classification metrics generally improved in line with the amount of synthetic data added. Our approach can be used to augment acoustic signals for more species and other land-use types, and has the potential to bring about advances in our capacity to develop reliable AI-based detection of rare species. Our code is available at https://github.com/gibbona1/SpectrogramGenAI.

Paper Structure

This paper contains 13 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Auxiliary Classifier GAN (ACGAN) architecture. For training, real images $x$ and their (one-hot encoded) labels $y$ are given, and sample noise vectors and $y$ are concatenated into latent vectors $z$. The Generator $G$ generates images $G(z)$ which come from the same distribution as $x$. Both $x$ and $G(z)$ are passed to a Discriminator $D$ which predicts 1) whether images are real or fake, and 2) their classes. A two-player minimax game arises, where $G$ learns to create images to fool $D$ and $D$ learns to better tell $x$ and $G(z)$ apart even as $G$ improves. Images are then generated using the generator as above for subsequent sampling.
  • Figure 2: Denoising UNet from diffusion model. We have a defined forward process $q(x_t\mid x_{t-1})$ to add noise to an image $x_{t-1}$ to create an image $x_t$. The reverse process $p_\theta(x_{t-1}\mid x_t)$ is performed by the UNet with parameters $\theta$.
  • Figure 3: Conditional Latent Diffusion architecture. For training, an image $x$ is encoded to a latent $z$ using the VQAE encoder. A random timepoint $t$ from $1,\dots,T$ is taken. Noise is added to $z$ for $t$ steps giving $z_t$. The UNet, with inputs $z_t$, $t$ and $y$, outputs predicted noise and the loss compares the predicted noise with the actual noise added to $z$. For sampling, a latent $z$ and class $y$ are generated/given, either pure noise or a partially noised image. For $T$ steps, the UNet runs with $z$, $t=1$ and $y$ as inputs and the output subtracted from $z$ itself, updating $z$. Once denoised $T$ times, this $\tilde{z}$ is vector quantised, decoded to $\tilde{x}$ and saved.
  • Figure 4: Real and Synthetic Spectrograms. The first two (leftmost) red-outlined images in each row are real spectrograms. The next three orange-outlined images are generated by the ACGAN, while the last three blue-outlined images are synthetic samples generated using the DDPM. The $x$-axis of each spectrogram represents time, ranging from 0-3 seconds. The $y$-axis represents frequency, ranging from 0 to 12kHz on the mel scale. Brighter colors (using the viridis colour palette) indicate higher energy or loudness. The ACGAN-generated Dunnock and Hooded Crow samples do show some species-specific features, but the Great Tit samples are poor imitations. The DDPM-generated samples show close similarity to the real examples in red but also have variety. The Great Tit examples generated by DDPM are clearly more convincing than the ACGAN samples.
  • Figure 5: Classification Accuracy Results. Top-1 (red line) and Top-5 (blue line) validation accuracy (y-axis) with varying synthetic data per class (x-axis). All models show a positive trend in accuracy improvement when including synthetic examples, particularly VGG. While the ResNet model's accuracy decreases beyond 50 additional synthetic samples per class, these values are still better than with only real data.
  • ...and 2 more figures