Table of Contents
Fetching ...

SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

Chenyang Le, Bing Han, Jinshun Li, Songyong Chen, Yanmin Qian

TL;DR

SimulMEGA presents an unsupervised policy-learning framework for simultaneous translation that leverages a Mixture-of-Experts refiner and a global routing gate to learn read/write decisions without adding inference-time cost. By integrating a prefix-based training regime and a dual-expert architecture with a leakage-free previous-output attention mechanism, the approach maintains high translation quality while achieving low latency across multilingual many-to-many S2TT and S2ST tasks, and extends to streaming TTS on CosyVoice2 backbones. The method demonstrates state-of-the-art quality-latency performance across six languages, with BLEU degradation remaining under 7% at 1.5 seconds average lag and under 3% at 3 seconds, and shows robust generalization and versatile streaming capabilities. These results highlight SimulMEGA's potential as a broadly applicable, low-overhead solution for real-time multilingual translation and dialogue systems.

Abstract

Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.

SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

TL;DR

SimulMEGA presents an unsupervised policy-learning framework for simultaneous translation that leverages a Mixture-of-Experts refiner and a global routing gate to learn read/write decisions without adding inference-time cost. By integrating a prefix-based training regime and a dual-expert architecture with a leakage-free previous-output attention mechanism, the approach maintains high translation quality while achieving low latency across multilingual many-to-many S2TT and S2ST tasks, and extends to streaming TTS on CosyVoice2 backbones. The method demonstrates state-of-the-art quality-latency performance across six languages, with BLEU degradation remaining under 7% at 1.5 seconds average lag and under 3% at 3 seconds, and shows robust generalization and versatile streaming capabilities. These results highlight SimulMEGA's potential as a broadly applicable, low-overhead solution for real-time multilingual translation and dialogue systems.

Abstract

Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.

Paper Structure

This paper contains 43 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of the training and inference paradigm of SimulMEGA. SimulMEGA is composed of a Streaming Speech Encoder, a text decoder, an MoE routing Gate(Router), and an MoE refiner. In the first stage, the model is pre-trained on $\mathcal{L}^\mathrm{offline}$. In the second stage, the model is trained on a combination of $\mathcal{L}_\mathrm{offline}$, $\mathcal{L}^\mathrm{prefix}$ and $\mathcal{L}^\mathrm{refiner}$. SSE denotes Streaming Speech Encoder.
  • Figure 2: (a) Structure and inference example of the streaming speech encoder of SimulMEGA. It comprises 20 chunkwise autoregressive (Chunk-AR) blocks and 4 non-autoregressive(NAR) blocks. After each read, the Chunk-AR blocks only compute the new chunk while the NAR blocks recompute the whole sequence. An End-of-Stream (EoSt) flag is given before NAR blocks. (B) The structure of the MoE Refiner, in which the gate decides over the mixture proportion of the Prefix Expert and the Global Expert. This proportion reflects the model's confidence in the prefix sequence, leading to a natural read/write policy. The self-attention module is replaced by a previous output attention(POA) module to prevent global information leakage.
  • Figure 3: The multilingual simultaneous speech-to-text translation quality (BLEU) against the latency metrics (LAAL) on different testsets. The reported BLEU and LAAL are the average of all language splits in the testsets(5 in CoVoST X-EN, 2 in CoVoST2 EN-X and 30 in Fleurs X-X)
  • Figure 4: Bi-directional Simultaneous S2TT and S2ST Result between Mandarin and English in CoVoST2 test set. The translation quality metric is BLEU for S2TT and ASR-BLEU for S2ST. We use the same threshold group between S2ST and S2TT. Results of NAST-S2S and StreamSpeech are taken from original paper.
  • Figure 5: Ablation Studies. The threshold for all ablation experiment evaluation is 0.8, 0.7, ..., 0.2 from left to right.
  • ...and 4 more figures