Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

Sameer Khurana; Nauman Dawalatabad; Antoine Laurent; Luis Vicente; Pablo Gimeno; Victoria Mingote; James Glass

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

Sameer Khurana, Nauman Dawalatabad, Antoine Laurent, Luis Vicente, Pablo Gimeno, Victoria Mingote, James Glass

TL;DR

This work tackles cross-lingual transfer in multilingual speech translation by injecting semantic knowledge into a multilingual speech encoder. It introduces SAMU-XLS-R, a semantic knowledge-distillation framework that aligns speech representations with LaBSE-derived semantics, and expands to 53 languages. By initializing the translation model encoder with SAMU-XLS-R and using MBART as decoder (with adapter-based fine-tuning), the method achieves substantial BLEU gains on CoVoST-2 and Europarl, particularly in high-to-low-resource and zero-shot scenarios. The results demonstrate that semantic-aware representations enable stronger cross-lingual transfer, reducing the transfer gap and enabling robust multilingual speech translation, though they depend on multilingual transcribed data and a semantic text encoder.

Abstract

Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAMU-XLS-R, a multilingual speech transformer encoder trained using multi-modal (speech-text) semantic knowledge distillation, we achieve significantly better cross-lingual task knowledge transfer than the baseline XLS-R, a multilingual speech transformer encoder trained via self-supervised learning. We demonstrate the effectiveness of our approach on two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU points over the baselines. In the zero-shot translation scenario, we achieve an average gain of 18.8 and 11.9 average BLEU points on unseen medium and low-resource languages. We make similar observations on Europarl speech translation benchmark.

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

TL;DR

Abstract

Paper Structure (40 sections, 1 equation, 6 figures, 6 tables)

This paper contains 40 sections, 1 equation, 6 figures, 6 tables.

Introduction
Motivation: Cross-Lingual Transfer Gap
Preliminaries
XLS-R ($\tt XLS\text{-}R$)
CNN Feature Extractor
Transformer Encoder
Training Details
SAMU-XLS-R ($\tt SAMU\text{-}XLS\text{-}R$)
The Speech Branch
The Text Branch
Training Details.
Method
Expanding SAMU-XLS-R
Translation Model
Overview
...and 25 more sections

Figures (6)

Figure 1: We report translation performance on 21 X$\rightarrow$EN speech-to-text translation tasks in CoVoST-2 benchmark with different sized pre-trained XLS-R encoders fine-tuned on labeled speech translation data. The 21 tasks are categorized into high, mid, and low resource tasks depending on the available labeled training data for a task. We report average BLEU-4 scores in the three categories. The important thing to consider is the performance gap or cross-lingual transfer gap between high and low-resource translation tasks. We address this large gap in this paper.
Figure 2: $\tt SAMU\text{-}XLS\text{-}R$semantic knowledge-distillation framework. The learning framework comprises a speech and a text encoder. The speech encoder transforms a raw speech waveform into an embedding vector. The text encoder transforms the transcript corresponding to the speech utterance into an embedding. The text encoder is initialized using the pre-trained Language-Agnostic BERT Sentence Embedding model $\tt LaBSE$feng2020languageagnostic. The speech encoder below the pooling layer is initialized using the pre-trained $\tt XLS\text{-}R$ speech encoder babu2021xlsr.
Figure 3: Number of hours of labeled training data (Y-Axis) for all the 21 X$\rightarrow$EN translation tasks in the CoVoST-2 benchmark.
Figure 4: We report average BLEU-4 for the zero-shot X$\rightarrow$EN multilingual speech-to-text translation scenario on the high, mid, and low resource task groups in the CoVoST-2 benchmark. We compare our translation model $\tt SAMU\text{-}XLS\text{-}R$-300M with the similarly sized $\tt XLS\text{-}R$-300M translation model. The translation models are only trained on high-resource groups, while the mid and low-resource groups are unseen during training.
Figure 5: Absolute BLEU score improvements using $\tt SAMU\text{-}XLS\text{-}R$-300M over $\tt XLS\text{-}R$-300M baseline on the 72 X$\rightarrow$Y translation tasks in the Europarl benchmark. The translation models are trained on a subset of 32 translation tasks, corresponding to four source languages, while 40 tasks are unseen during training corresponding to five source languages.
...and 1 more figures

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

TL;DR

Abstract

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)