Table of Contents
Fetching ...

When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation

Anna Min, Chenxu Hu, Yi Ren, Hang Zhao

TL;DR

This work reevaluates cascaded S2T translation by diagnosing error propagation as stemming from misalignment between acoustic and semantic spaces. It introduces MC-sslS, which integrates multiple ASR candidates with attention-averaging in MT and fuses self-supervised HuBERT speech representations to preserve linguistic detail. On the GigaST dataset, the approach yields BLEU improvements over standard cascaded systems and approaches end-to-end performance with fast training and no additional parameters. The method demonstrates practical value for leveraging pre-trained ASR/MT models to improve translation accuracy without the data and architectural overhead of fully end-to-end solutions.

Abstract

Though end-to-end speech-to-text translation has been a great success, we argue that the cascaded speech-to-text translation model still has its place, which is usually criticized for the error propagation between automatic speech recognition (ASR) and machine translation (MT) models. In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. By including multiple candidates and self-supervised speech features, our approach allows the machine translation model to choose the right words and ensure precise translation using various speech samples. This strategy minimizes error spread and takes advantage of large ASR and MT datasets, along with pre-trained ASR/MT models, while addressing associated issues.

When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation

TL;DR

This work reevaluates cascaded S2T translation by diagnosing error propagation as stemming from misalignment between acoustic and semantic spaces. It introduces MC-sslS, which integrates multiple ASR candidates with attention-averaging in MT and fuses self-supervised HuBERT speech representations to preserve linguistic detail. On the GigaST dataset, the approach yields BLEU improvements over standard cascaded systems and approaches end-to-end performance with fast training and no additional parameters. The method demonstrates practical value for leveraging pre-trained ASR/MT models to improve translation accuracy without the data and architectural overhead of fully end-to-end solutions.

Abstract

Though end-to-end speech-to-text translation has been a great success, we argue that the cascaded speech-to-text translation model still has its place, which is usually criticized for the error propagation between automatic speech recognition (ASR) and machine translation (MT) models. In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. By including multiple candidates and self-supervised speech features, our approach allows the machine translation model to choose the right words and ensure precise translation using various speech samples. This strategy minimizes error spread and takes advantage of large ASR and MT datasets, along with pre-trained ASR/MT models, while addressing associated issues.

Paper Structure

This paper contains 19 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The correct words are scattered among various candidates, while the former cascaded system directly selected the first candidate, resulting in similar pronunciation errors in the ASR output that are further propagated through the translation model, causing cascading losses.
  • Figure 2: Overview of our proposed MC-sslS system
  • Figure 3: Above is how the $3^{rd}$ process is calculated. After finding the longest common subsequences, candidates are aligned and padded. The orange circles denote “unk” tokens.