ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds
Masato Hagiwara, Marius Miron, Jen-Yu Liu
TL;DR
ISPA tackles the problem of standardizing animal sound representations by introducing a text-based transcription that can be processed with NLP techniques. It defines two streams, ISPA-A and ISPA-F, where ISPA-A encodes acoustic properties into tokens and ISPA-F discretizes per-frame features via clustering and Viterbi segmentation, with an IPA-aligned mapping for interpretability. In BEANS-based experiments, ISPA-F with AVES features and RoBERTa fine-tuning achieves competitive accuracy to dense-audio baselines, while IPA baselines underperform, and MLM-based self-supervision yields further gains. The work demonstrates the viability of applying language-model pipelines to bioacoustics, enabling downstream tasks such as detection, captioning, and multimodal generation, and suggests strong potential for scaling with larger datasets and models.
Abstract
Traditionally, bioacoustics has relied on spectrograms and continuous, per-frame audio representations for the analysis of animal sounds, also serving as input to machine learning models. Meanwhile, the International Phonetic Alphabet (IPA) system has provided an interpretable, language-independent method for transcribing human speech sounds. In this paper, we introduce ISPA (Inter-Species Phonetic Alphabet), a precise, concise, and interpretable system designed for transcribing animal sounds into text. We compare acoustics-based and feature-based methods for transcribing and classifying animal sounds, demonstrating their comparable performance with baseline methods utilizing continuous, dense audio representations. By representing animal sounds with text, we effectively treat them as a "foreign language," and we show that established human language ML paradigms and models, such as language models, can be successfully applied to improve performance.
