Table of Contents
Fetching ...

FASST: Fast LLM-based Simultaneous Speech Translation

Siqi Ouyang, Xi Xu, Chinmay Dandekar, Lei Li

TL;DR

FASST tackles the challenge of maintaining translation quality while reducing latency in simultaneous speech translation by introducing a blockwise-causal speech encoder and an incremental LLM decoding process guided by a consistency mask. A two-stage training pipeline first aligns speech embeddings with LLM embeddings using word-aligned contrastive loss, then finetunes the full model under a wait-$k$-stride-$n$ policy, enabling efficient streaming without full re-encoding. Experimental results on MuST-C show that FASST achieves a superior quality-latency trade-off, outperforming strong baselines by about 1.5 BLEU on En-Es at matched latency, and demonstrates robustness across data regimes and policy variants. The work highlights practical gains for real-time, LLM-based SST and suggests that encoding-heavy policies benefit most from incremental encoding and consistent masking, with potential for broader applicability and policy adaptations.

Abstract

Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.

FASST: Fast LLM-based Simultaneous Speech Translation

TL;DR

FASST tackles the challenge of maintaining translation quality while reducing latency in simultaneous speech translation by introducing a blockwise-causal speech encoder and an incremental LLM decoding process guided by a consistency mask. A two-stage training pipeline first aligns speech embeddings with LLM embeddings using word-aligned contrastive loss, then finetunes the full model under a wait--stride- policy, enabling efficient streaming without full re-encoding. Experimental results on MuST-C show that FASST achieves a superior quality-latency trade-off, outperforming strong baselines by about 1.5 BLEU on En-Es at matched latency, and demonstrates robustness across data regimes and policy variants. The work highlights practical gains for real-time, LLM-based SST and suggests that encoding-heavy policies benefit most from incremental encoding and consistent masking, with potential for broader applicability and policy adaptations.

Abstract

Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
Paper Structure (35 sections, 13 equations, 9 figures)

This paper contains 35 sections, 13 equations, 9 figures.

Figures (9)

  • Figure 1: Simultaneous speech translation with AlignAtt-0.1B, LST-7B and our FASST-7B. The LST-7B model generates translation with significantly higher latency than AlignAtt, while our FASST-7B achieves comparable latency with it.
  • Figure 2: Overview of FASST. (a) shows the offline translation of LLM-based ST model. (b) depicts the 2-stage training pipeline of FASST. Stage 1 aligns adapter output with LLM embedding and stage 2 finetunes for simultaneous translation using wait-$k$-stride-$n$ policy. (c) illustrates the simultaneous inference procedure of FASST with incremental speech encoding and LLM decoding with consistency mask.
  • Figure 3: Example of wait-$1$-stride-$2$. It waits for $1$ segment at the beginning and then alternate between generate $2$ words (including punctuations) and reading new segment.
  • Figure 4: Duration distribution of MuST-C-Short and MuST-C-Long. The average duration of MuST-C-Short is around 5 seconds while that of MuST-C-Long is around 25 seconds.
  • Figure 5: Quality-latency trade-off of FASST and baselines on English-Spanish and English-German direction. Quality is reflected by BLEU and latency is reflected by computation-aware length-adaptive average lagging (LAAL-CA). Given long speech input and large batch size, our model achieves overall the best quality-latency trade-off.
  • ...and 4 more figures