WhAM: Towards A Translative Model of Sperm Whale Vocalization
Orr Paradise, Pranav Muralikrishnan, Liangyuan Chen, Hugo Flores García, Bryan Pardo, Roee Diamant, David F. Gruber, Shane Gero, Shafi Goldwasser
TL;DR
WhAM introduces a transformer-based framework that translates arbitrary audio prompts into sperm whale codas, while also generating novel codas and learning embeddings useful for classification. Built on VampNet, WhAM uses a two-stage training regime (domain adaptation and species-specific fine-tuning) to capture sperm whale acoustic characteristics with modest data. Quantitative and perceptual evaluations demonstrate WhAM’s ability to produce perceptually realistic codas and to yield embeddings that support multiple downstream tasks, though limitations in click dynamics and vowel representation remain. The work highlights the potential of cross-domain acoustic translation in bioacoustics and provides a foundation for future scalable, domain-aware generative models with careful expert validation.
Abstract
Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at https://github.com/Project-CETI/wham
