Contrastive Feedback Mechanism for Simultaneous Speech Translation
Haotian Tan, Sakriani Sakti
TL;DR
The paper addresses the quality-latency trade-off in simultaneous speech translation (SST) and notes that unstable predictions from early chunks are typically ignored by existing decision policies. It introduces Contrastive Feedback Mechanism (CFM), which uses unstable predictions as feedback to refine subsequent chunk translations via a contrastive objective, formalized as $CFM$-$Score(P_c;P_f) = \\log P_c + \\mathrm{Contrast}(P_c;P_f)$ with $\\mathrm{Contrast}(P_c;P_f) = \\log rac{p_c(y_i|y_{<i})}{P_f}$ and a plausibility constraint $\\mathcal{V}_\\beta$ ($\\beta=0.1$). The method integrates with existing policies (AlignAtt, EDAtt, LA) by deriving feedback distributions $P_f$ from unstable predictions and rescoring current-chunk candidates accordingly. Empirical evaluation on MuST-C v1.0 across eight languages shows CFM consistently improves BLEU with negligible latency increases, achieving up to 2.05 BLEU-point gains (notably for LA on en→nl) and demonstrating the practicality of leveraging unstable predictions for real-time translation.
Abstract
Recent advances in simultaneous speech translation (SST) focus on the decision policies that enable the use of offline-trained ST models for simultaneous inference. These decision policies not only control the quality-latency trade-off in SST but also mitigate the impact of unstable predictions on translation quality by delaying translation for more context or discarding these predictions through stable hypothesis detection. However, these policies often overlook the potential benefits of utilizing unstable predictions. We introduce the contrastive feedback mechanism (CFM) for SST, a novel method that leverages these unstable predictions as feedback to improve translation quality. CFM guides the system to eliminate undesired model behaviors from these predictions through a contrastive objective. The experiments on 3 state-of-the-art decision policies across 8 languages in the MuST-C v1.0 dataset show that CFM effectively improves the performance of SST.
