Self-Speculative Biased Decoding for Faster Re-Translation
Linxiao Zeng, Haoyun Deng, Kangyuan Shu, Shizhen Wang
TL;DR
The paper tackles the latency challenge of deploying large language models for streaming translation by introducing Self-Speculative Biased Decoding (SSBD), a tuning-free decoding strategy that reuses the previous translation as a speculative draft and verifies it in a single forward pass, resuming from the first divergence. It adds a biased draft verification mechanism to improve draft acceptance without sacrificing eventual corrections, and a display-only mask-k to reduce user-visible flicker while preserving internal verification. Empirical results on Flores and ACL 60/60 show SSBD achieves 1.3–1.7× speedups with translation quality comparable to standard re-translation, requiring no architectural changes or fine-tuning. The method relies on temporal coherence in streaming inputs and offers practical benefits for off-the-shelf LLM-based simultaneous translation, with limitations related to prefix monotonicity and potential degradation if the bias is mis-tuned.
Abstract
Large language models achieve strong machine translation quality but incur high inference cost and latency, posing challenges for simultaneous translation. Re-translation provides a practical solution for off-the-shelf LLMs by repeatedly regenerating the target output as the source input grows, but it suffers from substantial redundant computation. We propose Self-Speculative Biased Decoding (SSBD), a simple and tuning-free inference method that accelerates re-translation by exploiting temporal coherence in streaming translation. SSBD reuses the model's previous output as a speculative draft for the updated input, verifies the draft efficiently in a single forward pass with a lightweight bias, and resumes autoregressive decoding only from the first divergence. We further introduce a display-only masking strategy that hides unstable suffixes from the user interface while retaining them in the draft for verification and potential acceptance. Experiments show that SSBD achieves substantial speedup over standard re-translation while maintaining comparable translation quality, without architectural changes, auxiliary models, or extra fine-tuning.
