Improved Cross-Lingual Transfer Learning For Automatic Speech Translation
Sameer Khurana, Nauman Dawalatabad, Antoine Laurent, Luis Vicente, Pablo Gimeno, Victoria Mingote, James Glass
TL;DR
This work tackles cross-lingual transfer in multilingual speech translation by injecting semantic knowledge into a multilingual speech encoder. It introduces SAMU-XLS-R, a semantic knowledge-distillation framework that aligns speech representations with LaBSE-derived semantics, and expands to 53 languages. By initializing the translation model encoder with SAMU-XLS-R and using MBART as decoder (with adapter-based fine-tuning), the method achieves substantial BLEU gains on CoVoST-2 and Europarl, particularly in high-to-low-resource and zero-shot scenarios. The results demonstrate that semantic-aware representations enable stronger cross-lingual transfer, reducing the transfer gap and enabling robust multilingual speech translation, though they depend on multilingual transcribed data and a semantic text encoder.
Abstract
Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAMU-XLS-R, a multilingual speech transformer encoder trained using multi-modal (speech-text) semantic knowledge distillation, we achieve significantly better cross-lingual task knowledge transfer than the baseline XLS-R, a multilingual speech transformer encoder trained via self-supervised learning. We demonstrate the effectiveness of our approach on two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU points over the baselines. In the zero-shot translation scenario, we achieve an average gain of 18.8 and 11.9 average BLEU points on unseen medium and low-resource languages. We make similar observations on Europarl speech translation benchmark.
