Table of Contents
Fetching ...

Encoding of lexical tone in self-supervised models of spoken language

Gaofei Shen, Michaela Watkins, Afra Alishahi, Arianna Bisazza, Grzegorz Chrupała

TL;DR

This work investigates how lexical tone, a suprasegmental cue, is encoded in self-supervised spoken-language representations. By probing wav2vec2-based models trained on both tonal and non-tonal languages, the study shows that tone information is recoverable even without explicit tonal supervision, with higher-layer encodings strengthened in tonal-language pretraining. ASR fine-tuning enhances tone representations for tonal-language models but can diminish them for non-tonal-language models, indicating task-driven specialization. The findings reveal human-like patterns in tone and consonant perception while highlighting differences in developmental trajectories, offering insights into suprasegmental representation and guiding future cross-language speech system design.

Abstract

Interpretability research has shown that self-supervised Spoken Language Models (SLMs) encode a wide variety of features in human speech from the acoustic, phonetic, phonological, syntactic and semantic levels, to speaker characteristics. The bulk of prior research on representations of phonology has focused on segmental features such as phonemes; the encoding of suprasegmental phonology (such as tone and stress patterns) in SLMs is not yet well understood. Tone is a suprasegmental feature that is present in more than half of the world's languages. This paper aims to analyze the tone encoding capabilities of SLMs, using Mandarin and Vietnamese as case studies. We show that SLMs encode lexical tone to a significant degree even when they are trained on data from non-tonal languages. We further find that SLMs behave similarly to native and non-native human participants in tone and consonant perception studies, but they do not follow the same developmental trajectory.

Encoding of lexical tone in self-supervised models of spoken language

TL;DR

This work investigates how lexical tone, a suprasegmental cue, is encoded in self-supervised spoken-language representations. By probing wav2vec2-based models trained on both tonal and non-tonal languages, the study shows that tone information is recoverable even without explicit tonal supervision, with higher-layer encodings strengthened in tonal-language pretraining. ASR fine-tuning enhances tone representations for tonal-language models but can diminish them for non-tonal-language models, indicating task-driven specialization. The findings reveal human-like patterns in tone and consonant perception while highlighting differences in developmental trajectories, offering insights into suprasegmental representation and guiding future cross-language speech system design.

Abstract

Interpretability research has shown that self-supervised Spoken Language Models (SLMs) encode a wide variety of features in human speech from the acoustic, phonetic, phonological, syntactic and semantic levels, to speaker characteristics. The bulk of prior research on representations of phonology has focused on segmental features such as phonemes; the encoding of suprasegmental phonology (such as tone and stress patterns) in SLMs is not yet well understood. Tone is a suprasegmental feature that is present in more than half of the world's languages. This paper aims to analyze the tone encoding capabilities of SLMs, using Mandarin and Vietnamese as case studies. We show that SLMs encode lexical tone to a significant degree even when they are trained on data from non-tonal languages. We further find that SLMs behave similarly to native and non-native human participants in tone and consonant perception studies, but they do not follow the same developmental trajectory.
Paper Structure (31 sections, 6 figures, 4 tables)

This paper contains 31 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: F0 contours of the four Mandarin tones measured from pronunciations recorded by one of the co-authors, a native speaker of Mandarin Chinese. The four syllables are pronounced in isolation (notation: mā T1, má T2, mǎ T3, mà T4).
  • Figure 2: Classification accuracy of Mandarin lexical tones using layer-wise representations from models pre-trained on tonal and non-tonal languages.
  • Figure 4: Classification accuracy of Mandarin lexical tones using layer-wise representations from models pre-trained and fine-tuned on Mandarin and English.
  • Figure 6: Classification accuracy of Mandarin lexical tones versus consonants for models pre-trained on English and Mandarin.
  • Figure 7: Binary classification accuracy for Mandarin tonal pairs, for English and Mandarin models.
  • ...and 1 more figures