How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer
Minu Kim, Ji Sub Um, Hoirin Kim
TL;DR
This work investigates how self-supervised speech models encode lexical tone in four Southeast Asian languages with diverse tonal systems. It combines baseline acoustic-span estimation (finding spans of $100$ ms for Burmese/Thai and $180$ ms for Lao/Vietnamese) with gradient-based and layer-wise probing of SSL encoders to assess where tone information resides. The study shows that mid-to-high layers best capture tone cues and that transfer is strongest when fine-tuning tasks align with the language-specific tone spans, with ASR on the target language providing the most faithful tone representations. The findings highlight tone as a transferable suprasegmental feature and offer guidance for robust, low-resource speech systems in tonal languages.
Abstract
Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.
