Table of Contents
Fetching ...

Speech to Speech Translation with Translatotron: A State of the Art Review

Jules R. Kala, Emmanuel Adetiba, Abdultaofeek Abayom, Oluwatobi E. Dare, Ayodele H. Ifijeh

TL;DR

The paper tackles the latency and error-compounding limitations of cascade speech-to-speech translation by evaluating direct S2ST models, focusing on Google’s Translatotron series. It traces evolution from Translatotron 1, a proof-of-concept that lagged behind cascades, to Translatotron 2, which matches cascade performance with end-to-end training, and finally to Translatotron 3, which surpasses cascade under certain conditions using unsupervised learning and parallel-data minimization. It also surveys S2ST corpora and discusses African-language implications, arguing that Translatotron 3’s unsupervised, speaker-preserving approach is well-suited for English–Yoruba applications in medical contexts. Overall, the work positions direct S2ST as a viable path to lower latency and better robustness, with practical impact for low-resource languages and multilingual healthcare.

Abstract

A cascade-based speech-to-speech translation has been considered a benchmark for a very long time, but it is plagued by many issues, like the time taken to translate a speech from one language to another and compound errors. These issues are because a cascade-based method uses a combination of methods such as speech recognition, speech-to-text translation, and finally, text-to-speech translation. Translatotron, a sequence-to-sequence direct speech-to-speech translation model was designed by Google to address the issues of compound errors associated with cascade model. Today there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron3. The first version was designed as a proof of concept to show that a direct speech-to-speech translation was possible, it was found to be less effective than the cascade model but was producing promising results. Translatotron2 was an improved version of Translatotron 1 with results similar to the cascade model. Translatotron 3 the latest version of the model is better than the cascade model at some points. In this paper, a complete review of speech-to-speech translation will be presented, with a particular focus on all the versions of Translatotron models. We will also show that Translatotron is the best model to bridge the language gap between African Languages and other well-formalized languages.

Speech to Speech Translation with Translatotron: A State of the Art Review

TL;DR

The paper tackles the latency and error-compounding limitations of cascade speech-to-speech translation by evaluating direct S2ST models, focusing on Google’s Translatotron series. It traces evolution from Translatotron 1, a proof-of-concept that lagged behind cascades, to Translatotron 2, which matches cascade performance with end-to-end training, and finally to Translatotron 3, which surpasses cascade under certain conditions using unsupervised learning and parallel-data minimization. It also surveys S2ST corpora and discusses African-language implications, arguing that Translatotron 3’s unsupervised, speaker-preserving approach is well-suited for English–Yoruba applications in medical contexts. Overall, the work positions direct S2ST as a viable path to lower latency and better robustness, with practical impact for low-resource languages and multilingual healthcare.

Abstract

A cascade-based speech-to-speech translation has been considered a benchmark for a very long time, but it is plagued by many issues, like the time taken to translate a speech from one language to another and compound errors. These issues are because a cascade-based method uses a combination of methods such as speech recognition, speech-to-text translation, and finally, text-to-speech translation. Translatotron, a sequence-to-sequence direct speech-to-speech translation model was designed by Google to address the issues of compound errors associated with cascade model. Today there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron3. The first version was designed as a proof of concept to show that a direct speech-to-speech translation was possible, it was found to be less effective than the cascade model but was producing promising results. Translatotron2 was an improved version of Translatotron 1 with results similar to the cascade model. Translatotron 3 the latest version of the model is better than the cascade model at some points. In this paper, a complete review of speech-to-speech translation will be presented, with a particular focus on all the versions of Translatotron models. We will also show that Translatotron is the best model to bridge the language gap between African Languages and other well-formalized languages.

Paper Structure

This paper contains 6 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Translatotron 1 architecture jia
  • Figure 2: Translatotron 2 Architecture and training method jia2
  • Figure 3: Translatotron 3 Architecture and Training Process nachmani
  • Figure 4: Speech-to-Speech translation Evolution tree