Table of Contents
Fetching ...

Transformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation

Yiwen Guan, Jacob Whitehill

Abstract

Multilingual translation suffers from computational redundancy, especially when translating into multiple languages simultaneously. In addition, translation quality can suffer for low-resource languages. To address this, we introduce Transformer Encoder Tree (TET), a hierarchical, non-autoregressive encoder-only architecture trained with Connectionist Temporal Classification (CTC) for multilingual translation. TET shares intermediate representations among linguistically similar target languages, improving accuracy on low-resource languages while reducing computational redundancy and enabling the generation of all target languages in a single forward pass. TET eliminates the sequential bottleneck of autoregressive models and supports fully parallel decoding of all tokens across all target languages. Compared to a naive one-to-many multilingual design, TET reduces the total parameter count by 66% and lowers inference computation by 60%. In speech translation, combining TET with a non-autoregressive speech recognition backbone (Wav2Vec2) shows competitive translation quality compared to autoregressive systems while speeding up inference by approximately 7-14 times.

Transformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation

Abstract

Multilingual translation suffers from computational redundancy, especially when translating into multiple languages simultaneously. In addition, translation quality can suffer for low-resource languages. To address this, we introduce Transformer Encoder Tree (TET), a hierarchical, non-autoregressive encoder-only architecture trained with Connectionist Temporal Classification (CTC) for multilingual translation. TET shares intermediate representations among linguistically similar target languages, improving accuracy on low-resource languages while reducing computational redundancy and enabling the generation of all target languages in a single forward pass. TET eliminates the sequential bottleneck of autoregressive models and supports fully parallel decoding of all tokens across all target languages. Compared to a naive one-to-many multilingual design, TET reduces the total parameter count by 66% and lowers inference computation by 60%. In speech translation, combining TET with a non-autoregressive speech recognition backbone (Wav2Vec2) shows competitive translation quality compared to autoregressive systems while speeding up inference by approximately 7-14 times.

Paper Structure

This paper contains 29 sections, 3 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Embedding clusters for last 3 layers in one EN-X multilingual translation model, which is a Transformer encoder-only model trained on 8 target languages, with a task token prepended to the input representing the language. The embeddings are averaged over the same 5792 sentences from Multi30K. The embeddings cluster according to their linguistic similarity.
  • Figure 2: Illustration of the MT/ST pipeline using Transformer-Encoder Tree (TET); the box in red shows the general architecture of TET for the Indo-European family. Each node in the tree represents one Transformer-encoder layer. An English sentence (or transcript generated by the ASR module) is passed to TET for processing; then either (1) sentences of all target languages are generated for MT task, or (2) speech tokens of all target languages are generated for S2ST task. The components with dashed lines are for speech translation tasks and are optional in the TET pipeline; we use Wav2Vec2 in NAR ASR module, and HiFi-GAN as NAR TTS vocoder.
  • Figure 3: BLEU distribution of 101 tree topologies trained for 10 epochs on Multi30K. The highest and the lowest BLEU scores achieved among 101 models are 26.20 and 20.23, respectively. TET (marked) ranks 4th among all trees with a BLEU score of 25.06.