Table of Contents
Fetching ...

Learning to rumble: Automated elephant call classification, detection and endpointing using deep architectures

Christiaan M. Geldenhuys, Thomas R. Niesler

TL;DR

This work develops a fully automated system for detecting, endpointing, and classifying elephant vocalisations from continuous audio using deep architectures, with a focus on frame-level detection to enable implicit endpointing and subcall analysis. The authors introduce an audio spectrogram transformer (AST) in a sequence-to-sequence setup, and demonstrate substantial gains through in-domain pretraining and transfer learning across African and Asian elephant datasets. They achieve state-of-the-art performance for framewise detection (AP ≈ 0.962) and multi-class call classification (AUC ≈ 0.957 for 5 classes and ≈0.979 for 7 subcall classes), including new benchmarks for subcall classification. The results indicate that transformer-based models, particularly when pre-trained on related data, offer significant practical potential for automated elephant monitoring, conservation, and behavioral inference in wildlife management contexts.

Abstract

We consider the problem of detecting, isolating and classifying elephant calls in continuously recorded audio. Such automatic call characterisation can assist conservation efforts and inform environmental management strategies. In contrast to previous work in which call detection was performed at a segment level, we perform call detection at a frame level which implicitly also allows call endpointing, the isolation of a call in a longer recording. For experimentation, we employ two annotated datasets, one containing Asian and the other African elephant vocalisations. We evaluate several shallow and deep classifier models, and show that the current best performance can be improved by using an audio spectrogram transformer (AST), a neural architecture which has not been used for this purpose before, and which we have configured in a novel sequence-to-sequence manner. We also show that using transfer learning by pre-training leads to further improvements both in terms of computational complexity and performance. Finally, we consider sub-call classification using an accepted taxonomy of call types, a task which has not previously been considered. We show that also in this case the transformer architectures provide the best performance. Our best classifiers achieve an average precision (AP) of 0.962 for framewise binary call classification, and an area under the receiver operating characteristic (AUC) of 0.957 and 0.979 for call classification with 5 classes and sub-call classification with 7 classes respectively. All of these represent either new benchmarks (sub-call classifications) or improvements on previously best systems. We conclude that a fully-automated elephant call detection and subcall classification system is within reach. Such a system would provide valuable information on the behaviour and state of elephant herds for the purposes of conservation and management.

Learning to rumble: Automated elephant call classification, detection and endpointing using deep architectures

TL;DR

This work develops a fully automated system for detecting, endpointing, and classifying elephant vocalisations from continuous audio using deep architectures, with a focus on frame-level detection to enable implicit endpointing and subcall analysis. The authors introduce an audio spectrogram transformer (AST) in a sequence-to-sequence setup, and demonstrate substantial gains through in-domain pretraining and transfer learning across African and Asian elephant datasets. They achieve state-of-the-art performance for framewise detection (AP ≈ 0.962) and multi-class call classification (AUC ≈ 0.957 for 5 classes and ≈0.979 for 7 subcall classes), including new benchmarks for subcall classification. The results indicate that transformer-based models, particularly when pre-trained on related data, offer significant practical potential for automated elephant monitoring, conservation, and behavioral inference in wildlife management contexts.

Abstract

We consider the problem of detecting, isolating and classifying elephant calls in continuously recorded audio. Such automatic call characterisation can assist conservation efforts and inform environmental management strategies. In contrast to previous work in which call detection was performed at a segment level, we perform call detection at a frame level which implicitly also allows call endpointing, the isolation of a call in a longer recording. For experimentation, we employ two annotated datasets, one containing Asian and the other African elephant vocalisations. We evaluate several shallow and deep classifier models, and show that the current best performance can be improved by using an audio spectrogram transformer (AST), a neural architecture which has not been used for this purpose before, and which we have configured in a novel sequence-to-sequence manner. We also show that using transfer learning by pre-training leads to further improvements both in terms of computational complexity and performance. Finally, we consider sub-call classification using an accepted taxonomy of call types, a task which has not previously been considered. We show that also in this case the transformer architectures provide the best performance. Our best classifiers achieve an average precision (AP) of 0.962 for framewise binary call classification, and an area under the receiver operating characteristic (AUC) of 0.957 and 0.979 for call classification with 5 classes and sub-call classification with 7 classes respectively. All of these represent either new benchmarks (sub-call classifications) or improvements on previously best systems. We conclude that a fully-automated elephant call detection and subcall classification system is within reach. Such a system would provide valuable information on the behaviour and state of elephant herds for the purposes of conservation and management.

Paper Structure

This paper contains 60 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Number of occurrences of each call type in the elev and the ldc corpora. Some call types are omitted from experiments due to an insufficient number of occurrences to allow cross-validation. Each included call type has been annotated with the class prevalence of each call type, within their respective dataset.
  • Figure 2: Number of occurrences of each subcall type present in the elev corpus. Some subcall types are Classes omitted from experiments due to an insufficient number of occurrences to allow cross-validation.
  • Figure 3: Mel-spectrogram feature representation of an elephant call used to illustrate (\ref{['fig:background:melspec:det']}) call detection and (\ref{['fig:background:melspec:class']}) call classification. On the left, $\hat{\mathbf{y}}^{(i)}_{d}$ denotes the classifier output for the $i$-th frame in a sequence of $N$ frames extracted from one recording. The shaded area indicates the frames for which a positive detection decision was made. On the right, $\hat{\mathbf{y}}_{c}$ denotes the multi-label classifier output for a single elephant call, already endpointed (shaded).
  • Figure 4: Illustration of the two strategies followed for call detection, described in \ref{['sec:call-event-det']}. On the left, $\hat{y}_{d}^{{(i)}}$ denotes the classifier output for the $i$-th (centre) input frame, given the input context window $\mathbf{X}^{{(i)}}$ consisting of $w$ consecutive spectral features $X^{{(i-\frac{w}{2})}} \ldots X^{(i+\frac{w}{2})}$. On the right, $\left[ \hat{y}^{(i)}_{d}\ldots \hat{y}^{(k)}_{d}\right]$ denotes the classification sequence produced by the classifier for frames $i$ to $k$, given an extended context window $\mathbf{X}^{(i,k)}$.
  • Figure 5: Model overview of classification models used for call detection. Only a single input sequence is shown, and batch processing is omitted from the illustration.
  • ...and 7 more figures