Table of Contents
Fetching ...

WhAM: Towards A Translative Model of Sperm Whale Vocalization

Orr Paradise, Pranav Muralikrishnan, Liangyuan Chen, Hugo Flores García, Bryan Pardo, Roee Diamant, David F. Gruber, Shane Gero, Shafi Goldwasser

TL;DR

WhAM introduces a transformer-based framework that translates arbitrary audio prompts into sperm whale codas, while also generating novel codas and learning embeddings useful for classification. Built on VampNet, WhAM uses a two-stage training regime (domain adaptation and species-specific fine-tuning) to capture sperm whale acoustic characteristics with modest data. Quantitative and perceptual evaluations demonstrate WhAM’s ability to produce perceptually realistic codas and to yield embeddings that support multiple downstream tasks, though limitations in click dynamics and vowel representation remain. The work highlights the potential of cross-domain acoustic translation in bioacoustics and provides a foundation for future scalable, domain-aware generative models with careful expert validation.

Abstract

Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at https://github.com/Project-CETI/wham

WhAM: Towards A Translative Model of Sperm Whale Vocalization

TL;DR

WhAM introduces a transformer-based framework that translates arbitrary audio prompts into sperm whale codas, while also generating novel codas and learning embeddings useful for classification. Built on VampNet, WhAM uses a two-stage training regime (domain adaptation and species-specific fine-tuning) to capture sperm whale acoustic characteristics with modest data. Quantitative and perceptual evaluations demonstrate WhAM’s ability to produce perceptually realistic codas and to yield embeddings that support multiple downstream tasks, though limitations in click dynamics and vowel representation remain. The work highlights the potential of cross-domain acoustic translation in bioacoustics and provides a foundation for future scalable, domain-aware generative models with careful expert validation.

Abstract

Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at https://github.com/Project-CETI/wham

Paper Structure

This paper contains 63 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Left: WhAM is trained by finetuning VampNet GarciaSKP23, an audio-to-audio transformer pretrained on a large music dataset (a). Namely, we perform domain adaptation (b) on animal vocalizations followed by species-specific finetuning (c) on a novel sperm whale coda dataset. Right: WhAM synthesizes context-aware variations (d) of input codas and acoustically translates (e) natural and (f) artificial audio into coda-like audio. Illustration © Alex Boersma 2025.
  • Figure 2: Overview of VampNet's generation pipeline. Input audio is first converted into a grid of tokens by the Tokenizer. These tokens are then partially masked to create a prompt. The Masked Acoustic Token Model (MATM) uses parallel iterative decoding to generate new tokens, which are finally converted back into audio by the Detokenizer. The colored squares represent acoustic tokens, with grey squares indicating masked positions.
  • Figure 3: Normalized Fréchet Audio Distance between sperm whale codas and various audio sources, before and after translation through WhAM. Lower FAD indicates greater acoustic similarity to natural codas. The horizontal line at 0.21 represents the baseline FAD between disjoint sets of natural codas. Full names of animals along with the number of samples from each can be found in \ref{['tab:marine_animals']}.
  • Figure 4: Expert performance on audio-only 2AFC (Task 1), mixed classification (Task 2), and spectrogram-assisted 2AFC (Task 3). Error bars show standard deviation across experts. While all tasks elicited above-chance performance (dashed line), spectrogram analysis showed the greatest variability between experts ($\sigma=0.17$). Task 1 and 3 had 30 pairs each, Task 2 had a collection of 25 samples.
  • Figure 5: Accuracy in mixed classification (Task 2) for different input domains. Natural codas (left) were misclassified as synthetic 36% of the time. The remaining columns depict performance on synthetic codas generated by WhAM from walrus vocalizations, non-coda acoustic impulses, and codas (respectively). There were five synthetic codas from each domain, plus ten natural codas for a total of 25 items.
  • ...and 8 more figures