Table of Contents
Fetching ...

Translatotron 3: Speech to Speech Translation with Monolingual Data

Eliya Nachmani, Alon Levkovitch, Yifan Ding, Chulayuth Asawaroengchai, Heiga Zen, Michelle Tadmor Ramanovich

TL;DR

Translatotron 3 tackles unsupervised direct speech-to-speech translation by learning from monolingual data with a shared encoder and two decoders. It combines a masked autoencoder pretraining scheme, unsupervised MUSE embedding alignment, and back-translation across two language directions, enabling end-to-end S2ST without bilingual speech data. The approach achieves large BLEU gains over a cascade baseline on synthesized and real data, and demonstrates notable preservation of para-/non-linguistic cues like pauses, speaking rate, and speaker identity, narrowing gaps to supervised systems on CVSS English–Spanish. This work suggests a practical path for high-quality S2ST in low-resource settings where bilingual speech data are scarce.

Abstract

This paper presents Translatotron 3, a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets by combining masked autoencoder, unsupervised embedding mapping, and back-translation. Experimental results in speech-to-speech translation tasks between Spanish and English show that Translatotron 3 outperforms a baseline cascade system, reporting $18.14$ BLEU points improvement on the synthesized Unpaired-Conversational dataset. In contrast to supervised approaches that necessitate real paired data, or specialized modeling to replicate para-/non-linguistic information such as pauses, speaking rates, and speaker identity, Translatotron 3 showcases its capability to retain it. Audio samples can be found at http://google-research.github.io/lingvo-lab/translatotron3

Translatotron 3: Speech to Speech Translation with Monolingual Data

TL;DR

Translatotron 3 tackles unsupervised direct speech-to-speech translation by learning from monolingual data with a shared encoder and two decoders. It combines a masked autoencoder pretraining scheme, unsupervised MUSE embedding alignment, and back-translation across two language directions, enabling end-to-end S2ST without bilingual speech data. The approach achieves large BLEU gains over a cascade baseline on synthesized and real data, and demonstrates notable preservation of para-/non-linguistic cues like pauses, speaking rate, and speaker identity, narrowing gaps to supervised systems on CVSS English–Spanish. This work suggests a practical path for high-quality S2ST in low-resource settings where bilingual speech data are scarce.

Abstract

This paper presents Translatotron 3, a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets by combining masked autoencoder, unsupervised embedding mapping, and back-translation. Experimental results in speech-to-speech translation tasks between Spanish and English show that Translatotron 3 outperforms a baseline cascade system, reporting BLEU points improvement on the synthesized Unpaired-Conversational dataset. In contrast to supervised approaches that necessitate real paired data, or specialized modeling to replicate para-/non-linguistic information such as pauses, speaking rates, and speaker identity, Translatotron 3 showcases its capability to retain it. Audio samples can be found at http://google-research.github.io/lingvo-lab/translatotron3
Paper Structure (18 sections, 8 equations, 1 figure, 3 tables)

This paper contains 18 sections, 8 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The two training phases in the proposed approach. (i) Phase 1 uses the reconstruction loss via the auto-encoding path. (ii) Phase 2 employs the reconstruction loss via back-translation