Table of Contents
Fetching ...

DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder

Tao Liu, Chenpeng Du, Shuai Fan, Feilong Chen, Kai Yu

TL;DR

DiffDub tackles the problem of person-generic visual dubbing by decoupling rendering from lip synchronization in a two-stage diffusion framework. It introduces a diffusion auto-encoder inpainting renderer guided by a semantic latent $z_{sem}$ and a masked editing region, paired with a Conformer-based video sequence generator that uses a cross-attention mechanism to fuse multiple reference textures with audio latent codes $a$ into frames via latent codes $z'_{1:T}$. The approach employs DDIM for efficient inference, eye-region augmentation for stability, and robust data handling to achieve high visual quality and multilingual capabilities without fine-tuning per speaker, as demonstrated on the HDTF dataset with comprehensive quantitative and qualitative evaluations. DiffDub advances practical dubbing applications by delivering seamless, intelligible videos across speakers and languages, supported by ablations and user studies that confirm its robustness and effectiveness in person-generic contexts.

Abstract

Generating high-quality and person-generic visual dubbing remains a challenge. Recent innovation has seen the advent of a two-stage paradigm, decoupling the rendering and lip synchronization process facilitated by intermediate representation as a conduit. Still, previous methodologies rely on rough landmarks or are confined to a single speaker, thus limiting their performance. In this paper, we propose DiffDub: Diffusion-based dubbing. We first craft the Diffusion auto-encoder by an inpainting renderer incorporating a mask to delineate editable zones and unaltered regions. This allows for seamless filling of the lower-face region while preserving the remaining parts. Throughout our experiments, we encountered several challenges. Primarily, the semantic encoder lacks robustness, constricting its ability to capture high-level features. Besides, the modeling ignored facial positioning, causing mouth or nose jitters across frames. To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance. Moreover, we encapsulated a conformer-based reference encoder and motion generator fortified by a cross-attention mechanism. This enables our model to learn person-specific textures with varying references and reduces reliance on paired audio-visual data. Our rigorous experiments comprehensively highlight that our ground-breaking approach outpaces existing methods with considerable margins and delivers seamless, intelligible videos in person-generic and multilingual scenarios.

DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder

TL;DR

DiffDub tackles the problem of person-generic visual dubbing by decoupling rendering from lip synchronization in a two-stage diffusion framework. It introduces a diffusion auto-encoder inpainting renderer guided by a semantic latent and a masked editing region, paired with a Conformer-based video sequence generator that uses a cross-attention mechanism to fuse multiple reference textures with audio latent codes into frames via latent codes . The approach employs DDIM for efficient inference, eye-region augmentation for stability, and robust data handling to achieve high visual quality and multilingual capabilities without fine-tuning per speaker, as demonstrated on the HDTF dataset with comprehensive quantitative and qualitative evaluations. DiffDub advances practical dubbing applications by delivering seamless, intelligible videos across speakers and languages, supported by ablations and user studies that confirm its robustness and effectiveness in person-generic contexts.

Abstract

Generating high-quality and person-generic visual dubbing remains a challenge. Recent innovation has seen the advent of a two-stage paradigm, decoupling the rendering and lip synchronization process facilitated by intermediate representation as a conduit. Still, previous methodologies rely on rough landmarks or are confined to a single speaker, thus limiting their performance. In this paper, we propose DiffDub: Diffusion-based dubbing. We first craft the Diffusion auto-encoder by an inpainting renderer incorporating a mask to delineate editable zones and unaltered regions. This allows for seamless filling of the lower-face region while preserving the remaining parts. Throughout our experiments, we encountered several challenges. Primarily, the semantic encoder lacks robustness, constricting its ability to capture high-level features. Besides, the modeling ignored facial positioning, causing mouth or nose jitters across frames. To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance. Moreover, we encapsulated a conformer-based reference encoder and motion generator fortified by a cross-attention mechanism. This enables our model to learn person-specific textures with varying references and reduces reliance on paired audio-visual data. Our rigorous experiments comprehensively highlight that our ground-breaking approach outpaces existing methods with considerable margins and delivers seamless, intelligible videos in person-generic and multilingual scenarios.
Paper Structure (9 sections, 2 equations, 3 figures, 3 tables)

This paper contains 9 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Dubbed videos with audio in various languages. Our method can produce seamless and intelligible videos.
  • Figure 2: Architecture of DiffDub. Our DiffDub approach upholds a two-stage paradigm encompassing inpainting rendering with Diffusion Auto-encoder and video sequence generation. In the first stage, we usher in a Diffusion Auto-encoder with masked conditions to generate semantic latent codes $z$ through the semantic encoder. Subsequently, during the video generation phase, the semantic latent code $z$, in tandem with the audio latent code $a$ derived from an extant model, is employed to generate the final videos.
  • Figure 3: Qualitative results on Reconstruction & Dubbing. The corresponding pronounced syllables are highlighted in red.