Table of Contents
Fetching ...

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

Sameer Khurana, Nauman Dawalatabad, Antoine Laurent, Luis Vicente, Pablo Gimeno, Victoria Mingote, James Glass

TL;DR

This work tackles cross-lingual transfer in multilingual speech translation by injecting semantic knowledge into a multilingual speech encoder. It introduces SAMU-XLS-R, a semantic knowledge-distillation framework that aligns speech representations with LaBSE-derived semantics, and expands to 53 languages. By initializing the translation model encoder with SAMU-XLS-R and using MBART as decoder (with adapter-based fine-tuning), the method achieves substantial BLEU gains on CoVoST-2 and Europarl, particularly in high-to-low-resource and zero-shot scenarios. The results demonstrate that semantic-aware representations enable stronger cross-lingual transfer, reducing the transfer gap and enabling robust multilingual speech translation, though they depend on multilingual transcribed data and a semantic text encoder.

Abstract

Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAMU-XLS-R, a multilingual speech transformer encoder trained using multi-modal (speech-text) semantic knowledge distillation, we achieve significantly better cross-lingual task knowledge transfer than the baseline XLS-R, a multilingual speech transformer encoder trained via self-supervised learning. We demonstrate the effectiveness of our approach on two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU points over the baselines. In the zero-shot translation scenario, we achieve an average gain of 18.8 and 11.9 average BLEU points on unseen medium and low-resource languages. We make similar observations on Europarl speech translation benchmark.

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

TL;DR

This work tackles cross-lingual transfer in multilingual speech translation by injecting semantic knowledge into a multilingual speech encoder. It introduces SAMU-XLS-R, a semantic knowledge-distillation framework that aligns speech representations with LaBSE-derived semantics, and expands to 53 languages. By initializing the translation model encoder with SAMU-XLS-R and using MBART as decoder (with adapter-based fine-tuning), the method achieves substantial BLEU gains on CoVoST-2 and Europarl, particularly in high-to-low-resource and zero-shot scenarios. The results demonstrate that semantic-aware representations enable stronger cross-lingual transfer, reducing the transfer gap and enabling robust multilingual speech translation, though they depend on multilingual transcribed data and a semantic text encoder.

Abstract

Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAMU-XLS-R, a multilingual speech transformer encoder trained using multi-modal (speech-text) semantic knowledge distillation, we achieve significantly better cross-lingual task knowledge transfer than the baseline XLS-R, a multilingual speech transformer encoder trained via self-supervised learning. We demonstrate the effectiveness of our approach on two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU points over the baselines. In the zero-shot translation scenario, we achieve an average gain of 18.8 and 11.9 average BLEU points on unseen medium and low-resource languages. We make similar observations on Europarl speech translation benchmark.
Paper Structure (40 sections, 1 equation, 6 figures, 6 tables)

This paper contains 40 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We report translation performance on 21 X$\rightarrow$EN speech-to-text translation tasks in CoVoST-2 benchmark with different sized pre-trained XLS-R encoders fine-tuned on labeled speech translation data. The 21 tasks are categorized into high, mid, and low resource tasks depending on the available labeled training data for a task. We report average BLEU-4 scores in the three categories. The important thing to consider is the performance gap or cross-lingual transfer gap between high and low-resource translation tasks. We address this large gap in this paper.
  • Figure 2: $\tt SAMU\text{-}XLS\text{-}R$semantic knowledge-distillation framework. The learning framework comprises a speech and a text encoder. The speech encoder transforms a raw speech waveform into an embedding vector. The text encoder transforms the transcript corresponding to the speech utterance into an embedding. The text encoder is initialized using the pre-trained Language-Agnostic BERT Sentence Embedding model $\tt LaBSE$feng2020languageagnostic. The speech encoder below the pooling layer is initialized using the pre-trained $\tt XLS\text{-}R$ speech encoder babu2021xlsr.
  • Figure 3: Number of hours of labeled training data (Y-Axis) for all the 21 X$\rightarrow$EN translation tasks in the CoVoST-2 benchmark.
  • Figure 4: We report average BLEU-4 for the zero-shot X$\rightarrow$EN multilingual speech-to-text translation scenario on the high, mid, and low resource task groups in the CoVoST-2 benchmark. We compare our translation model $\tt SAMU\text{-}XLS\text{-}R$-300M with the similarly sized $\tt XLS\text{-}R$-300M translation model. The translation models are only trained on high-resource groups, while the mid and low-resource groups are unseen during training.
  • Figure 5: Absolute BLEU score improvements using $\tt SAMU\text{-}XLS\text{-}R$-300M over $\tt XLS\text{-}R$-300M baseline on the 72 X$\rightarrow$Y translation tasks in the Europarl benchmark. The translation models are trained on a subset of 32 translation tasks, corresponding to four source languages, while 40 tasks are unseen during training corresponding to five source languages.
  • ...and 1 more figures