Table of Contents
Fetching ...

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation

Lucas Goncalves, Prashant Mathur, Xing Niu, Brady Houston, Chandrashekhar Lavania, Srikanth Vishnubhotla, Lijia Sun, Anthony Ferritto

TL;DR

The paper tackles lip-synchrony in direct AVS2S by integrating a lip-synchrony loss into training alongside a duration predictor, enabling translated speech to align with original lip movements without modifying the visual content. Using an AV-HuBERT–based AV encoder, a unit-to-unit translator, and a vocoder, the method overlays translated speech on the original video and optimizes a total loss $L_{total} = L_{sync} + \lambda L_{dur}$, with $\mathcal{L}_{\text{dur}} = \frac{1}{N} \sum_{i=1}^{N} (\log d^{p}_i - \log d_i)^2$ and $\lambda = 10$. Evaluated on LRS3 across four language pairs, the approach achieves an average LSE-D of $10.67$, a $9.2\%$ improvement over a strong baseline, while preserving speech naturalness and translation quality as indicated by PESQ, BLASER, and ASR-BLEU metrics. The results demonstrate that lip-synchrony constraints can be effectively integrated into AVS2S to produce realistic dubbed content without introducing face edits or artifacts, with ablations showing the importance of combining lip-sync and duration losses and the potential trade-offs when exploring paraphrase-based translation options.

Abstract

Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation

TL;DR

The paper tackles lip-synchrony in direct AVS2S by integrating a lip-synchrony loss into training alongside a duration predictor, enabling translated speech to align with original lip movements without modifying the visual content. Using an AV-HuBERT–based AV encoder, a unit-to-unit translator, and a vocoder, the method overlays translated speech on the original video and optimizes a total loss , with and . Evaluated on LRS3 across four language pairs, the approach achieves an average LSE-D of , a improvement over a strong baseline, while preserving speech naturalness and translation quality as indicated by PESQ, BLASER, and ASR-BLEU metrics. The results demonstrate that lip-synchrony constraints can be effectively integrated into AVS2S to produce realistic dubbed content without introducing face edits or artifacts, with ablations showing the importance of combining lip-sync and duration losses and the potential trade-offs when exploring paraphrase-based translation options.

Abstract

Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.

Paper Structure

This paper contains 15 sections, 3 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: AVS2S Framework Overview
  • Figure 2: Prompt for generating paraphrases using Claude 3.0 Sonnet.