Table of Contents
Fetching ...

Video Editing for Audio-Visual Dubbing

Binyamin Manela, Sharon Gannot, Ethan Fetyaya

TL;DR

EdiDub reframes visual dubbing as a content-aware editing problem that preserves original visual context while synchronizing lip movements to new speech. It introduces a two-stage diffusion pipeline (LSD for lip-region editing at $64\times64$ and SRD for full-resolution refinement) guided by accurate visual references, quantized HuBERT speech tokens, and DDIM inversion to maintain fidelity. The method demonstrates superior lip-sync accuracy, identity preservation, and visual naturalness across challenging datasets, including occluded-lip scenarios, and is validated by both quantitative metrics and human MOS evaluations. The approach offers a robust alternative to generation or inpainting, enabling faithful, context-aware dubbing with practical applicability to global media distribution, albeit with current computational efficiency limitations.

Abstract

Visual dubbing, the synchronization of facial movements with new speech, is crucial for making content accessible across different languages, enabling broader global reach. However, current methods face significant limitations. Existing approaches often generate talking faces, hindering seamless integration into original scenes, or employ inpainting techniques that discard vital visual information like partial occlusions and lighting variations. This work introduces EdiDub, a novel framework that reformulates visual dubbing as a content-aware editing task. EdiDub preserves the original video context by utilizing a specialized conditioning scheme to ensure faithful and accurate modifications rather than mere copying. On multiple benchmarks, including a challenging occluded-lip dataset, EdiDub significantly improves identity preservation and synchronization. Human evaluations further confirm its superiority, achieving higher synchronization and visual naturalness scores compared to the leading methods. These results demonstrate that our content-aware editing approach outperforms traditional generation or inpainting, particularly in maintaining complex visual elements while ensuring accurate lip synchronization.

Video Editing for Audio-Visual Dubbing

TL;DR

EdiDub reframes visual dubbing as a content-aware editing problem that preserves original visual context while synchronizing lip movements to new speech. It introduces a two-stage diffusion pipeline (LSD for lip-region editing at and SRD for full-resolution refinement) guided by accurate visual references, quantized HuBERT speech tokens, and DDIM inversion to maintain fidelity. The method demonstrates superior lip-sync accuracy, identity preservation, and visual naturalness across challenging datasets, including occluded-lip scenarios, and is validated by both quantitative metrics and human MOS evaluations. The approach offers a robust alternative to generation or inpainting, enabling faithful, context-aware dubbing with practical applicability to global media distribution, albeit with current computational efficiency limitations.

Abstract

Visual dubbing, the synchronization of facial movements with new speech, is crucial for making content accessible across different languages, enabling broader global reach. However, current methods face significant limitations. Existing approaches often generate talking faces, hindering seamless integration into original scenes, or employ inpainting techniques that discard vital visual information like partial occlusions and lighting variations. This work introduces EdiDub, a novel framework that reformulates visual dubbing as a content-aware editing task. EdiDub preserves the original video context by utilizing a specialized conditioning scheme to ensure faithful and accurate modifications rather than mere copying. On multiple benchmarks, including a challenging occluded-lip dataset, EdiDub significantly improves identity preservation and synchronization. Human evaluations further confirm its superiority, achieving higher synchronization and visual naturalness scores compared to the leading methods. These results demonstrate that our content-aware editing approach outperforms traditional generation or inpainting, particularly in maintaining complex visual elements while ensuring accurate lip synchronization.

Paper Structure

This paper contains 42 sections, 5 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: Qualitative comparison across time. Each row presents 7 consecutive frames from a different method, with the original video at the top and our model at the bottom. Existing methods create severe artifacts on these examples where the mouth is occluded.
  • Figure 2: Overview of our dubbing system EdiDub. (a) Model architectures for both stages of our pipeline. (b) End-to-end inference process for lip-synced facial video generation.
  • Figure 3: MOS evaluation interface presented to participants. Each task consists of six randomized videos that participants rated for naturalness and audio-visual synchronization.
  • Figure 4: Example frame from a corrupted video used for attention screening. The mouth region exhibits visible inconsistencies and noise artifacts.