Table of Contents
Fetching ...

How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer

Minu Kim, Ji Sub Um, Hoirin Kim

TL;DR

This work investigates how self-supervised speech models encode lexical tone in four Southeast Asian languages with diverse tonal systems. It combines baseline acoustic-span estimation (finding spans of $100$ ms for Burmese/Thai and $180$ ms for Lao/Vietnamese) with gradient-based and layer-wise probing of SSL encoders to assess where tone information resides. The study shows that mid-to-high layers best capture tone cues and that transfer is strongest when fine-tuning tasks align with the language-specific tone spans, with ASR on the target language providing the most faithful tone representations. The findings highlight tone as a transferable suprasegmental feature and offer guidance for robust, low-resource speech systems in tonal languages.

Abstract

Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.

How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer

TL;DR

This work investigates how self-supervised speech models encode lexical tone in four Southeast Asian languages with diverse tonal systems. It combines baseline acoustic-span estimation (finding spans of ms for Burmese/Thai and ms for Lao/Vietnamese) with gradient-based and layer-wise probing of SSL encoders to assess where tone information resides. The study shows that mid-to-high layers best capture tone cues and that transfer is strongest when fine-tuning tasks align with the language-specific tone spans, with ASR on the target language providing the most faithful tone representations. The findings highlight tone as a transferable suprasegmental feature and offer guidance for robust, low-resource speech systems in tonal languages.

Abstract

Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.

Paper Structure

This paper contains 13 sections, 2 equations, 7 figures.

Figures (7)

  • Figure 1: Tone distributions across languages. (Vietnamese tones: ngang (high level), huyền (low level), sắc (rising), hỏi (falling-rising), ngã (broken), nặng (falling) pham2003key.)
  • Figure 2: Baseline tone classification macro-F1 with logistic regression across varying window lengths (20--300 ms). Shorter spans suffice for Burmese and Thai, whereas Lao and Vietnamese require longer spans.
  • Figure 3: Normalized gradient energy around tone centers with XLS-R ASR models fine-tuned in each language. Burmese/Thai show sharper focus, while Lao/Vietnamese show broader spreads, consistent with acoustic span baselines.
  • Figure 4: Layer-wise probe performance (macro-F1) across Burmese, Thai, Lao, and Vietnamese using different SSL models (XLS-R-target: ASR on target language; XLS-R-ZH: ASR on Mandarin Chinese; XLS-R-EN: ASR on English; XLS-R-ASV: speaker verification; XLS-R-emotion: emotion recognition; XLS-R-gender: gender classification). Probe performance usually peaks in mid-to-high layers, showing that lexical tone information is mainly captured at those layers.
  • Figure 5: Distributions of layerwise center-of-mass radius $r_{\mathrm{com}}^{(\ell)}$ for tone gradients, shown separately for lower (0–11) and higher (12–24) layers across models (W2V2: wav2vec 2.0 large; Vanilla: XLS-R-vanilla; ZH: XLS-R-ZH; EN: XLS-R-EN; ASV: XLS-R-ASV; Emotion: XLS-R-emotion; Gender: XLS-R-gender). Red line indicates baseline span; target ASR shows the closest fit.
  • ...and 2 more figures