Unveiling the Role of Pretraining in Direct Speech Translation

Belen Alastruey; Gerard I. Gállego; Marta R. Costa-jussà

Unveiling the Role of Pretraining in Direct Speech Translation

Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà

TL;DR

A subtle change in the decoder cross-attention to integrate source information from earlier steps in training is proposed and it is shown that with this change, the model trained from scratch can achieve comparable performance to the pretrained one, while reducing the training time.

Abstract

Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this study, we compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch. We observe that, throughout the training, the randomly initialized model struggles to incorporate information from the speech inputs for its predictions. Hence, we hypothesize that this issue stems from the difficulty of effectively training an encoder for direct speech translation. While a model trained from scratch needs to learn acoustic and semantic modeling simultaneously, a pretrained one can just focus on the latter. Based on these findings, we propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training. We show that with this change, the model trained from scratch can achieve comparable performance to the pretrained one, while reducing the training time.

Unveiling the Role of Pretraining in Direct Speech Translation

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 2 figures, 2 tables)

This paper contains 17 sections, 2 equations, 2 figures, 2 tables.

Introduction
Related Work
Interpretability of Transformer Models
Training Dynamics on Machine Translation
Target-side language modeling:
Learning how to use source:
Refining translations:
Training Dynamics in Speech Translation
Results Analysis
Training ST from Scratch
WeRC: Weighted Residual Connection
Ablation Study
Conclusions
Experimental Setup
ST Models Details:
...and 2 more sections

Figures (2)

Figure 1: Source contribution ($\pm$std) and BLEU along the full training (right) and along the first 10k updates (left).
Figure 2: Standard S2T-Transformer cross-attention layer (left) and proposed WeRC (right).

Unveiling the Role of Pretraining in Direct Speech Translation

TL;DR

Abstract

Unveiling the Role of Pretraining in Direct Speech Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)