Table of Contents
Fetching ...

Cross-Lingual Transfer Learning for Speech Translation

Rao Ma, Mengjie Qian, Yassir Fathullah, Siyuan Tang, Mark Gales, Kate Knill

TL;DR

This work probes how multilingual speech foundation models like Whisper can extend speech translation to new languages with restricted data. It demonstrates that encoder outputs across languages occupy a shared semantic space, enabling zero-shot transfer and cross-lingual translation when the decoder is guided by multilingual tokens. Zero-shot decoding with appropriate tokens, and targeted fine-tuning (notably en→zh) can expand translation to new targets and unseen sources, albeit with risks of catastrophic forgetting if the encoder is altered too much. The findings advance practical multilingual speech translation by showing cross-lingual alignment can be leveraged to broaden language support without extensive data collection, with implications for scalable, low-resource language inclusion.

Abstract

There has been increasing interest in building multilingual foundation models for NLP and speech research. This paper examines how to expand the speech translation capability of these models with restricted data. Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model. Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space. This shared embedding space can then be leveraged for zero-shot cross-lingual transfer in speech translation. By fine-tuning the Whisper decoder with only English-to-Chinese speech translation data, improved performance for translation to Chinese can be obtained for multiple languages, in addition to English. Furthermore, for languages related to those seen in training it is possible to perform speech translation, despite the model never seeing the language in training, or being able to perform transcription.

Cross-Lingual Transfer Learning for Speech Translation

TL;DR

This work probes how multilingual speech foundation models like Whisper can extend speech translation to new languages with restricted data. It demonstrates that encoder outputs across languages occupy a shared semantic space, enabling zero-shot transfer and cross-lingual translation when the decoder is guided by multilingual tokens. Zero-shot decoding with appropriate tokens, and targeted fine-tuning (notably en→zh) can expand translation to new targets and unseen sources, albeit with risks of catastrophic forgetting if the encoder is altered too much. The findings advance practical multilingual speech translation by showing cross-lingual alignment can be leveraged to broaden language support without extensive data collection, with implications for scalable, low-resource language inclusion.

Abstract

There has been increasing interest in building multilingual foundation models for NLP and speech research. This paper examines how to expand the speech translation capability of these models with restricted data. Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model. Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space. This shared embedding space can then be leveraged for zero-shot cross-lingual transfer in speech translation. By fine-tuning the Whisper decoder with only English-to-Chinese speech translation data, improved performance for translation to Chinese can be obtained for multiple languages, in addition to English. Furthermore, for languages related to those seen in training it is possible to perform speech translation, despite the model never seeing the language in training, or being able to perform transcription.
Paper Structure (26 sections, 2 equations, 4 figures, 12 tables)

This paper contains 26 sections, 2 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Illustration of Whisper's decoding process for ASR and speech translation tasks. Whisper supports speech recognition in 100 languages and speech translation from any language into English (orange (German, $de$, input) and purple (French, $fr$, input) text blocks). Fine-tuning on English-to-Chinese, $en\!\rightarrow\!zh$, speech translation data enables the model to acquire additional speech translation capabilities (such as $de\!\rightarrow\!zh$ and $fr\!\rightarrow\!zh$) through cross-lingual transfer (gray text blocks). The Whisper < transcribe> task token is used in this case as the < translate> task token causes English words to be output, independent of the target language.
  • Figure 2: t-SNE visualization of contextual speech embeddings generated by Whisper large-v2 encoder for 6 word tuples across 5 languages.
  • Figure 3: Cosine similarity matrix of utterance representations between an English sentence and its German counterpart selected from FLEURS test sets.
  • Figure 4: Speech-to-speech retrieval using outputs from different encoder layers of Whisper large-v2.