Table of Contents
Fetching ...

FinchGPT: a Transformer based language model for birdsong analysis

Kosei Kobayashi, Kosuke Matsuzaki, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui, Kentaro Abe

TL;DR

The paper investigates whether Bengalese finch songs exhibit long-range dependencies similar to human language and tests Transformer-based language models on a texturized birdsong corpus. FinchGPT, a Transformer model trained from scratch on species-specific syllable sequences, outperforms Markov, RNN, and LSTM baselines in next-syllable prediction and reveals long-range dependencies via attention mechanisms. Reverse engineering through attention-span restriction and HVC ablation shows the model relies on non-adjacent dependencies perturbed by brain manipulations, linking artificial processing to neural mechanisms. The findings suggest that large language models can reveal structure in animal vocalizations and provide a framework for comparing computational and neural processing of sequential vocalizations across species.

Abstract

The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for investigating this phenomenon. In this study, we employed the Transformer architecture to analyze the songs of Bengalese finch (Lonchura striata domestica), which are characterized by their highly variable and complex syllable sequences. To this end, we developed FinchGPT, a Transformer-based model trained on a textualized corpus of birdsongs, which outperformed other architecture models in this domain. Attention weight analysis revealed that FinchGPT effectively captures long-range dependencies within syllables sequences. Furthermore, reverse engineering approaches demonstrated the impact of computational and biological manipulations on its performance: restricting FinchGPT's attention span and disrupting birdsong syntax through the ablation of specific brain nuclei markedly influenced the model's outputs. Our study highlights the transformative potential of large language models (LLMs) in deciphering the complexities of animal vocalizations, offering a novel framework for exploring the structural properties of non-human communication systems while shedding light on the computational distinctions between biological brains and artificial neural networks.

FinchGPT: a Transformer based language model for birdsong analysis

TL;DR

The paper investigates whether Bengalese finch songs exhibit long-range dependencies similar to human language and tests Transformer-based language models on a texturized birdsong corpus. FinchGPT, a Transformer model trained from scratch on species-specific syllable sequences, outperforms Markov, RNN, and LSTM baselines in next-syllable prediction and reveals long-range dependencies via attention mechanisms. Reverse engineering through attention-span restriction and HVC ablation shows the model relies on non-adjacent dependencies perturbed by brain manipulations, linking artificial processing to neural mechanisms. The findings suggest that large language models can reveal structure in animal vocalizations and provide a framework for comparing computational and neural processing of sequential vocalizations across species.

Abstract

The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for investigating this phenomenon. In this study, we employed the Transformer architecture to analyze the songs of Bengalese finch (Lonchura striata domestica), which are characterized by their highly variable and complex syllable sequences. To this end, we developed FinchGPT, a Transformer-based model trained on a textualized corpus of birdsongs, which outperformed other architecture models in this domain. Attention weight analysis revealed that FinchGPT effectively captures long-range dependencies within syllables sequences. Furthermore, reverse engineering approaches demonstrated the impact of computational and biological manipulations on its performance: restricting FinchGPT's attention span and disrupting birdsong syntax through the ablation of specific brain nuclei markedly influenced the model's outputs. Our study highlights the transformative potential of large language models (LLMs) in deciphering the complexities of animal vocalizations, offering a novel framework for exploring the structural properties of non-human communication systems while shedding light on the computational distinctions between biological brains and artificial neural networks.

Paper Structure

This paper contains 22 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Generation and evaluation of GPT trained with birdsong.(A) A picture of Bengalese finch and a spectrogram of an example song of Bengalese finch, each syllable labelled with arbitrary alphabet symbols. (B) Schematic diagrams showing the core computation concept of each algorithm in token processing. t1–t5 represent sequence of tokens input. Red highlights the token currently being processed and the information utilized for computation. The Markov model only predicts the next token according to the transition probabilities. RNN and LSTM propagate information primarily through hidden states from nearby tokens, while Transformer captures global context via self-attention. (C) Percentage of correct answers in the next token prediction task (mean ± SEM, n = 3 corpus each from different bird). (D) Cross-entropy between model outputs and correct tokens. Box plot shows median and first and third quantiles. Actual data points representing each corpus are shown in gray. (E) Performance changes with artificially structured training and test corpora (mean ± SEM, n = 3 corpus each from different bird). P values, Paired t-test, n = 3. (F) Diagram of the FinchGPT architecture. L, A, and H represent the number of layers, attention heads, and hidden state dimensions, respectively. The embedded dimensions size is identical to the hidden state. (G) Comparison of cross-entropy in the next token prediction task performed on FinchGPT with varying model parameters: layer (left, A = L, H = 192), and hidden state dimensions (right, L = 6 and A = 6). (H) Same as (G) with varying training text data sizes. FinchGPT-small (1L, 1A, 192H), medium (6L, 6A, 384H), and large (12L, 12A, 768H) were shown. Mean ± SEM, n = 235,256 predictions across 2,660 songs for both comparisons for (G–H).
  • Figure 2: Attention visualization reveals the long-range dependencies within a song.(A) An Example of self-attention matrices within FinchGPT-medium (6L, 6A, 384H) when processing a song with 46 tokens (44 syllables + 2 special tokens) as input. The horizontal and vertical axes represent the order of input syllables, while the attention wights between syllables are color coded to indicate their relative intensities. (B) Example of observed attention in layer 1 and 6. Arrows represent the attention weights forming the maximum spanning tree, as identified by Chu-Liu-Edmonds algorithm. (C) Average span length of token pairs in each layer. Attention weights greater than 0.5 are analyzed (mean ± SEM, n = 2,935, 1,403, 4,124, 17,054, 22,670, and 11,080 from layer 1 to 6). (D) The cross-entropy values of the model restricted to specific attention span lengths (mean ± SEM, n = 235,256 prediction across 2,660 songs). The numbers on x-axis indicate the maximum number of immediately preceding tokens that can be attended to in each attention operation. Unrestricted FinchGPT utilizes an attention span length of up to 256 tokens.
  • Figure 3: Token embedding analysis reveals the contextual usage of syllables.(A) Three-dimensional plots showing the trajectory of changes in token embeddings for three example songs across Transformer layers. The original 384-dimensional embeddings were reduced to 3 dimensions by PCA. Different colors represent distinct syllables. The three sample songs were randomly selected from the corpus of a single individual. (B) Same as (A), only the tokens for syllable “e” and “h” are shown. (C) Distribution of cosine similarity relative to the mean embeddings of each layer in 384-dimensional space. The vertical axis is presented on a logarithmic scale.
  • Figure 4: Ablation of brain nucleus reduces the performance of FinchGPT.(A) Schematic showing the song related nucleus of the songbird brain. Circuit diagrams show analogous function of songbird nucleus and corresponding human brain. Arrow indicates axonal projections. LMAN, lateral magnocellular nucleus of the aniterior nidopallium: DLM, dorsal lateral nucleus of the medial thalamus; nXIIts, XII cranical nerve; LMC, laryngeal motor cortex; LSC, laryngeal somatosensory cortex. (B) Example of the spectrograms of songs from identical birds before and after HVC ablation. (C) Example of spectrograms of syllables before and after HVC ablation (left) and the comparison of pitch values (right). Each point represents one syllable. Scale bar, 50 ms. P values, Wilcoxon signed-rank test, n = 13. (D) Cross-entropy of FinchGPT-medium (6L, 6A, 384H) in the next-token prediction task, comparing the performance on before and after HVC ablation corpora. The model was trained on the before HVC ablation corpus, and the results were evaluated using the holdout method. Box plot shows median and first and third quantiles with mean shown as circles, P values, Wilcoxon rank sum test. n = 38,599 times prediction within 752 songs recorded before ablation, and n = 59,111 prediction across 1,180 songs recorded after ablation. (E) The cross-entropy values of the model restricted to specific attention span lengths were analyzed for the next-token prediction task for songs from before and after HVC ablation (mean ± SEM, n = 38,599 and 59,111 predictions across 752 and 1,180 songs in the before- and after-ablation corpora, respectively). (F) Comparison of performance FinchGPT with restricted Attention span length in classification of songs before and after HVC ablation. Accuracy (left), and ROC (Rate of change) and AUC (Area Under the Receiver Operating Characteristics Curve) (right). Dashed lines show the theoretical upper bound (left), and the chance level (right). The corpus of after HVC ablation was considered as positive in the ROC analysis.