FASST: Fast LLM-based Simultaneous Speech Translation
Siqi Ouyang, Xi Xu, Chinmay Dandekar, Lei Li
TL;DR
FASST tackles the challenge of maintaining translation quality while reducing latency in simultaneous speech translation by introducing a blockwise-causal speech encoder and an incremental LLM decoding process guided by a consistency mask. A two-stage training pipeline first aligns speech embeddings with LLM embeddings using word-aligned contrastive loss, then finetunes the full model under a wait-$k$-stride-$n$ policy, enabling efficient streaming without full re-encoding. Experimental results on MuST-C show that FASST achieves a superior quality-latency trade-off, outperforming strong baselines by about 1.5 BLEU on En-Es at matched latency, and demonstrates robustness across data regimes and policy variants. The work highlights practical gains for real-time, LLM-based SST and suggests that encoding-heavy policies benefit most from incremental encoding and consistent masking, with potential for broader applicability and policy adaptations.
Abstract
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
