Table of Contents
Fetching ...

Strategies for improving low resource speech to text translation relying on pre-trained ASR models

Santosh Kesiraju, Marek Sarvas, Tomas Pavlicek, Cecile Macaire, Alejandro Ciuba

TL;DR

The paper addresses improving low-resource speech translation by leveraging pre-trained multilingual ASR initializations and incorporating a CTC-based auxiliary objective within an encoder-decoder framework. It formalizes training with $L_{asr}=\lambda L_{ctc}+(1-\lambda)L_{att}$ and $L_{st}=\alpha L_{ctc}+(1-\alpha)L_{att}$, and uses joint decoding with $\hat{\mathbf{z}}=\arg\max_{\mathbf{z}}\beta\log p_{ctc}(\mathbf{z}|\mathbf{x})+(1-\beta)\log p_{att}(\mathbf{z}|\mathbf{x})$. The key finding is that a multilingual ASR pretrained with about 300 hours of data can achieve strong ST results (e.g., 7.3 BLEU on Tamasheq→French, +1.6 BLEU over IWSLT'22), with CTC objectives consistently improving performance across initializations and decoding schemes. The work demonstrates the viability of multilingual ASR initialization for low-resource ST and motivates future exploration of multilingual ST fine-tuning and representation alignment to broaden practical impact for underserved languages.

Abstract

This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST). We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively. Using the encoder-decoder framework for ST, our results show that a multilingual automatic speech recognition system acts as a good initialization under low-resource scenarios. Furthermore, using the CTC as an additional objective for translation during training and decoding helps to reorder the internal representations and improves the final translation. Through our experiments, we try to identify various factors (initializations, objectives, and hyper-parameters) that contribute the most for improvements in low-resource setups. With only 300 hours of pre-training data, our model achieved 7.3 BLEU score on Tamasheq - French data, outperforming prior published works from IWSLT 2022 by 1.6 points.

Strategies for improving low resource speech to text translation relying on pre-trained ASR models

TL;DR

The paper addresses improving low-resource speech translation by leveraging pre-trained multilingual ASR initializations and incorporating a CTC-based auxiliary objective within an encoder-decoder framework. It formalizes training with and , and uses joint decoding with . The key finding is that a multilingual ASR pretrained with about 300 hours of data can achieve strong ST results (e.g., 7.3 BLEU on Tamasheq→French, +1.6 BLEU over IWSLT'22), with CTC objectives consistently improving performance across initializations and decoding schemes. The work demonstrates the viability of multilingual ASR initialization for low-resource ST and motivates future exploration of multilingual ST fine-tuning and representation alignment to broaden practical impact for underserved languages.

Abstract

This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST). We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively. Using the encoder-decoder framework for ST, our results show that a multilingual automatic speech recognition system acts as a good initialization under low-resource scenarios. Furthermore, using the CTC as an additional objective for translation during training and decoding helps to reorder the internal representations and improves the final translation. Through our experiments, we try to identify various factors (initializations, objectives, and hyper-parameters) that contribute the most for improvements in low-resource setups. With only 300 hours of pre-training data, our model achieved 7.3 BLEU score on Tamasheq - French data, outperforming prior published works from IWSLT 2022 by 1.6 points.
Paper Structure (13 sections, 3 equations, 3 figures, 5 tables)

This paper contains 13 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Cascaded and end-to-end frameworks for speech translation. $\mathbf{x}$ is the input speech (features), $\mathbf{y}$ is the corresponding text transcriptions, and $\mathbf{z}$ is the target text translations. $\mathbf{h}$ is the hidden representation from ASR that establishes the continuous path between ASR and MT models. The ASR, MT, encoder and decoder modules can be initialized from various kinds of pre-trained models.
  • Figure 2: Performance of ST systems on taq $\rightarrow$ fr dataset, relying on various initialization, fine-tuning and decoding schemes.
  • Figure 3: Effect of various initialization and amounts of ST fine-tuning data.