Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
Lucas Goncalves, Prashant Mathur, Xing Niu, Brady Houston, Chandrashekhar Lavania, Srikanth Vishnubhotla, Lijia Sun, Anthony Ferritto
TL;DR
The paper tackles lip-synchrony in direct AVS2S by integrating a lip-synchrony loss into training alongside a duration predictor, enabling translated speech to align with original lip movements without modifying the visual content. Using an AV-HuBERT–based AV encoder, a unit-to-unit translator, and a vocoder, the method overlays translated speech on the original video and optimizes a total loss $L_{total} = L_{sync} + \lambda L_{dur}$, with $\mathcal{L}_{\text{dur}} = \frac{1}{N} \sum_{i=1}^{N} (\log d^{p}_i - \log d_i)^2$ and $\lambda = 10$. Evaluated on LRS3 across four language pairs, the approach achieves an average LSE-D of $10.67$, a $9.2\%$ improvement over a strong baseline, while preserving speech naturalness and translation quality as indicated by PESQ, BLASER, and ASR-BLEU metrics. The results demonstrate that lip-synchrony constraints can be effectively integrated into AVS2S to produce realistic dubbed content without introducing face edits or artifacts, with ablations showing the importance of combining lip-sync and duration losses and the potential trade-offs when exploring paraphrase-based translation options.
Abstract
Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.
