Table of Contents
Fetching ...

AudioProtoPNet: An interpretable deep learning model for bird sound classification

René Heinrich, Lukas Rauch, Bernhard Sick, Christoph Scholz

TL;DR

AudioProtoPNet, an adaptation of the Prototypical Part Network for multi-label bird sound classification, is an inherently interpretable model that uses a ConvNeXt backbone to extract embeddings, with the classification layer replaced by a prototype learning classifier trained on these embeddings.

Abstract

Deep learning models have significantly advanced acoustic bird monitoring by being able to recognize numerous bird species based on their vocalizations. However, traditional deep learning models are black boxes that provide no insight into their underlying computations, limiting their usefulness to ornithologists and machine learning engineers. Explainable models could facilitate debugging, knowledge discovery, trust, and interdisciplinary collaboration. This study introduces AudioProtoPNet, an adaptation of the Prototypical Part Network (ProtoPNet) for multi-label bird sound classification. It is an inherently interpretable model that uses a ConvNeXt backbone to extract embeddings, with the classification layer replaced by a prototype learning classifier trained on these embeddings. The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of training instances. During inference, audio recordings are classified by comparing them to the learned prototypes in the embedding space, providing explanations for the model's decisions and insights into the most informative embeddings of each bird species. The model was trained on the BirdSet training dataset, which consists of 9,734 bird species and over 6,800 hours of recordings. Its performance was evaluated on the seven test datasets of BirdSet, covering different geographical regions. AudioProtoPNet outperformed the state-of-the-art model Perch, achieving an average AUROC of 0.90 and a cmAP of 0.42, with relative improvements of 7.1% and 16.7% over Perch, respectively. These results demonstrate that even for the challenging task of multi-label bird sound classification, it is possible to develop powerful yet inherently interpretable deep learning models that provide valuable insights for ornithologists and machine learning engineers.

AudioProtoPNet: An interpretable deep learning model for bird sound classification

TL;DR

AudioProtoPNet, an adaptation of the Prototypical Part Network for multi-label bird sound classification, is an inherently interpretable model that uses a ConvNeXt backbone to extract embeddings, with the classification layer replaced by a prototype learning classifier trained on these embeddings.

Abstract

Deep learning models have significantly advanced acoustic bird monitoring by being able to recognize numerous bird species based on their vocalizations. However, traditional deep learning models are black boxes that provide no insight into their underlying computations, limiting their usefulness to ornithologists and machine learning engineers. Explainable models could facilitate debugging, knowledge discovery, trust, and interdisciplinary collaboration. This study introduces AudioProtoPNet, an adaptation of the Prototypical Part Network (ProtoPNet) for multi-label bird sound classification. It is an inherently interpretable model that uses a ConvNeXt backbone to extract embeddings, with the classification layer replaced by a prototype learning classifier trained on these embeddings. The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of training instances. During inference, audio recordings are classified by comparing them to the learned prototypes in the embedding space, providing explanations for the model's decisions and insights into the most informative embeddings of each bird species. The model was trained on the BirdSet training dataset, which consists of 9,734 bird species and over 6,800 hours of recordings. Its performance was evaluated on the seven test datasets of BirdSet, covering different geographical regions. AudioProtoPNet outperformed the state-of-the-art model Perch, achieving an average AUROC of 0.90 and a cmAP of 0.42, with relative improvements of 7.1% and 16.7% over Perch, respectively. These results demonstrate that even for the challenging task of multi-label bird sound classification, it is possible to develop powerful yet inherently interpretable deep learning models that provide valuable insights for ornithologists and machine learning engineers.
Paper Structure (19 sections, 6 equations, 7 figures, 2 tables)

This paper contains 19 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of the importance of inherently interpretable models such as AudioProtoPNet for different audiences of the bird sound classification task.
  • Figure 2: Illustration of the audioprotopnet inference process. The raw audio waveforms are first converted to spectrograms and then mapped to embedding space via a CNN backbone. There the similarities between the spectrogram embeddings and the learned prototypes are computed. While the similarities serve as local explanations for the models' classification, the learned prototypes themselves represent global explanations. Finally, the confidence score for each class is computed by weighting the similarities to the different prototypes of the respective class with a linear layer and then applying a sigmoid activation function.
  • Figure 3: Average performance of audioprotopnet-5 (solid), ConvNeXt (dashed), and Perch (dotted) for the seven test datasets over five different random seeds, shown in a radar plot. With respect to the auroc metric, audioprotopnet-5 outperformed ConvNeXt and Perch on all test datasets. Also, audioprotopnet-5 achieved the best cmap and top-1 accuracy values for most of the test datasets. Perch only achieved a slightly better cmap value for the NES test dataset, and a much higher top-1 accuracy for the NES and UHH datasets.
  • Figure 4: The spectrograms of the most similar instances from the SNE subset of the training dataset, with their labels highlighted in bold, for the five prototypes learned by audioprotopnet-5 for the Mountain Chickadee (mouchi). The prototypes correspond to the parts of the spectrograms surrounded by the orange bounding boxes. Additionally, the similarity values to the prototypes $s^{(c,j)}$ and the weights of the respective prototypes $w^{(j,c)}$ in the final layer are shown.
  • Figure 5: The spectrograms of the most similar instances from the SNE subset of the training dataset, with their labels highlighted in bold, for the five prototypes learned by audioprotopnet-5 for the Yellow-rumped Warbler (yerwar). The prototypes correspond to the parts of the spectrograms surrounded by the orange bounding boxes. Additionally, the similarity values to the prototypes $s^{(c,j)}$ and the weights of the respective prototypes $w^{(j,c)}$ in the final layer are shown. Prototype 1 also shows a high similarity to instances of the bird species Wilson's Warbler (wlswar) and Brown Creeper (brncre).
  • ...and 2 more figures