SimulTron: On-Device Simultaneous Speech to Speech Translation

Alex Agranovich; Eliya Nachmani; Oleg Rybakov; Yifan Ding; Ye Jia; Nadav Bar; Heiga Zen; Michelle Tadmor Ramanovich

SimulTron: On-Device Simultaneous Speech to Speech Translation

Alex Agranovich, Eliya Nachmani, Oleg Rybakov, Yifan Ding, Ye Jia, Nadav Bar, Heiga Zen, Michelle Tadmor Ramanovich

TL;DR

SimulTron targets real-time, on-device simultaneous speech-to-speech translation by extending the Translatotron lineage with a causal streaming encoder, wait-$k$ attention-based decoder, and a streaming vocoder. The architecture enables on-device translation with adjustable latency via a fixed delay and demonstrates execution on a Pixel 7 Pro, achieving strong BLEU performance and favorable latency compared to prior real-time S2ST methods on MuST-C, while also surpassing offline Translatotron baselines in several settings. Key findings include a BLEU improvement over Translatotron 1 in real-time Spanish–English, substantial BLEU gains in offline settings, and clear latency-accuracy trade-offs as the waiting parameter $k$ is varied. The work advances practical, privacy-preserving S2ST on mobile devices and lays groundwork for broader multilingual on-device translation with further vocoder and hardware optimizations.

Abstract

Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the strengths of the Translatotron framework while incorporating key modifications for streaming operation, and an adjustable fixed delay. Our experiments show that SimulTron surpasses Translatotron 2 in offline evaluations. Furthermore, real-time evaluations reveal that SimulTron improves upon the performance achieved by Translatotron 1. Additionally, SimulTron achieves superior BLEU scores and latency compared to previous real-time S2ST method on the MuST-C dataset. Significantly, we have successfully deployed SimulTron on a Pixel 7 Pro device, show its potential for simultaneous S2ST on-device.

SimulTron: On-Device Simultaneous Speech to Speech Translation

TL;DR

SimulTron targets real-time, on-device simultaneous speech-to-speech translation by extending the Translatotron lineage with a causal streaming encoder, wait-

attention-based decoder, and a streaming vocoder. The architecture enables on-device translation with adjustable latency via a fixed delay and demonstrates execution on a Pixel 7 Pro, achieving strong BLEU performance and favorable latency compared to prior real-time S2ST methods on MuST-C, while also surpassing offline Translatotron baselines in several settings. Key findings include a BLEU improvement over Translatotron 1 in real-time Spanish–English, substantial BLEU gains in offline settings, and clear latency-accuracy trade-offs as the waiting parameter

is varied. The work advances practical, privacy-preserving S2ST on mobile devices and lays groundwork for broader multilingual on-device translation with further vocoder and hardware optimizations.

Abstract

Paper Structure (13 sections, 2 figures, 5 tables)

This paper contains 13 sections, 2 figures, 5 tables.

Introduction
Related Work
Offline speech-to-speech translation
Real time speech-to-speech translation
Model Architecture
Streaming Encoder
Streaming Decoder and Vocoder
Real-time Inference
Experiments and Results
Results
Conversational dataset
MuST-C dataset
Conclusion

Figures (2)

Figure 1: An overview of the proposed SimulTorn architecture. First, the streaming encoder generates a compact representation of the source language input. Subsequently, the decoder, employing wait-k attention, produces a mel-spectrogram representation of the target translation. The MelGAN vocoder then synthesizes the final translated speech output from the mel-spectrogram.
Figure 2: The latency attributed to each model components (Encoder, Decoder+Vocoder), is assessed, with a delineation at the 25-millisecond threshold denoting the practical real-time operational limit.

SimulTron: On-Device Simultaneous Speech to Speech Translation

TL;DR

Abstract

SimulTron: On-Device Simultaneous Speech to Speech Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)