Table of Contents
Fetching ...

Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically

Ryan Soh-Eun Shim, Domenico De Cristofaro, Chengzhi Martin Hu, Alessandro Vietti, Barbara Plank

TL;DR

This work investigates whether cross-lingual alignment in speech foundation models is truly semantic or largely driven by phonetic cues. It introduces a pronunciation-controlled challenge set and SeqSimInterp to interpret retrieval decisions, and uses early exiting to probe how representations evolve across encoder layers. The findings show that semantic information underpins cross-lingual spoken translation retrieval, though phonetic cues still influence early-layer behavior; multilingual supervision (e.g., speech translation training) enhances semantic alignment. The results demonstrate practical benefits for low-resource languages and offer methods to analyze and leverage semantic structure in speech models, with implications for model design and zero-shot transfer.

Abstract

Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.

Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically

TL;DR

This work investigates whether cross-lingual alignment in speech foundation models is truly semantic or largely driven by phonetic cues. It introduces a pronunciation-controlled challenge set and SeqSimInterp to interpret retrieval decisions, and uses early exiting to probe how representations evolve across encoder layers. The findings show that semantic information underpins cross-lingual spoken translation retrieval, though phonetic cues still influence early-layer behavior; multilingual supervision (e.g., speech translation training) enhances semantic alignment. The results demonstrate practical benefits for low-resource languages and offer methods to analyze and leverage semantic structure in speech models, with implications for model design and zero-shot transfer.

Abstract

Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.

Paper Structure

This paper contains 25 sections, 1 equation, 9 figures, 9 tables.

Figures (9)

  • Figure 1: s3m_word find the center frame of an audio embedding to retain word identity information. As such, we infer the center frame representation for each word in an utterance by way of word-level timestamps. The timestamps are obtained through applying dynamic time warping to cross-attention weights zusag24_interspeech.
  • Figure 2: Illustration of our proposed method to determine whether cross-lingual speech retrieval relies on semantic features. Starting with the center frame embeddings obtained in \ref{['fig:frame-emb']}, we match each center frame to the most similar frame in the target utterance based on cosine similarity. We repeat this in the reverse direction. We then obtain words the frames belong to by inferred timestamps, and quantify to what degree words that mutually select each other as the most similar are semantically equivalent with a multilingual text encoder.
  • Figure 3: Word-level analyses on whisper-medium. x-axis is layer count, y-axis is cosine similarity.
  • Figure 4: Word-level analyses on whisper-tiny. x-axis is layer count, y-axis is cosine similarity.
  • Figure 5: Word-level analyses on whisper-base. x-axis is layer count, y-axis is cosine similarity.
  • ...and 4 more figures