Table of Contents
Fetching ...

Identity-Preserving Video Dubbing Using Motion Warping

Runzhen Liu, Qinjie Lin, Yunfei Liu, Lijian Lin, Ye Zhu, Yu Li, Chuhua Xian, Fa-Ting Hong

TL;DR

IPTalker tackles the challenge of identity-preserving video dubbing by introducing a transformer-based Audio-Visual Alignment Unit (AVAU) that selects reference-mouth appearances aligned with driving audio. It then applies a motion-warping stage to deform the reference images while preserving texture, followed by an inpainting step to resolve occlusions, all trained with a multi-term loss including perceptual, adversarial, and lip-sync components. Across HDTF and VFHQ datasets, IPTalker achieves state-of-the-art realism, temporal coherence, and strong identity retention, outperforming existing methods in lip-sync accuracy and texture fidelity. The approach offers a robust solution for high-quality, identity-consistent dubbing with potential applications in AR/VR, video conferencing, and digital media production.

Abstract

Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal. Although existing methods can accurately generate mouth shapes driven by audio, they often fail to preserve identity-specific features, largely because they do not effectively capture the nuanced interplay between audio cues and the visual attributes of reference identity . As a result, the generated outputs frequently lack fidelity in reproducing the unique textural and structural details of the reference identity. To address these limitations, we propose IPTalker, a novel and robust framework for video dubbing that achieves seamless alignment between driving audio and reference identity while ensuring both lip-sync accuracy and high-fidelity identity preservation. At the core of IPTalker is a transformer-based alignment mechanism designed to dynamically capture and model the correspondence between audio features and reference images, thereby enabling precise, identity-aware audio-visual integration. Building on this alignment, a motion warping strategy further refines the results by spatially deforming reference images to match the target audio-driven configuration. A dedicated refinement process then mitigates occlusion artifacts and enhances the preservation of fine-grained textures, such as mouth details and skin features. Extensive qualitative and quantitative evaluations demonstrate that IPTalker consistently outperforms existing approaches in terms of realism, lip synchronization, and identity retention, establishing a new state of the art for high-quality, identity-consistent video dubbing.

Identity-Preserving Video Dubbing Using Motion Warping

TL;DR

IPTalker tackles the challenge of identity-preserving video dubbing by introducing a transformer-based Audio-Visual Alignment Unit (AVAU) that selects reference-mouth appearances aligned with driving audio. It then applies a motion-warping stage to deform the reference images while preserving texture, followed by an inpainting step to resolve occlusions, all trained with a multi-term loss including perceptual, adversarial, and lip-sync components. Across HDTF and VFHQ datasets, IPTalker achieves state-of-the-art realism, temporal coherence, and strong identity retention, outperforming existing methods in lip-sync accuracy and texture fidelity. The approach offers a robust solution for high-quality, identity-consistent dubbing with potential applications in AR/VR, video conferencing, and digital media production.

Abstract

Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal. Although existing methods can accurately generate mouth shapes driven by audio, they often fail to preserve identity-specific features, largely because they do not effectively capture the nuanced interplay between audio cues and the visual attributes of reference identity . As a result, the generated outputs frequently lack fidelity in reproducing the unique textural and structural details of the reference identity. To address these limitations, we propose IPTalker, a novel and robust framework for video dubbing that achieves seamless alignment between driving audio and reference identity while ensuring both lip-sync accuracy and high-fidelity identity preservation. At the core of IPTalker is a transformer-based alignment mechanism designed to dynamically capture and model the correspondence between audio features and reference images, thereby enabling precise, identity-aware audio-visual integration. Building on this alignment, a motion warping strategy further refines the results by spatially deforming reference images to match the target audio-driven configuration. A dedicated refinement process then mitigates occlusion artifacts and enhances the preservation of fine-grained textures, such as mouth details and skin features. Extensive qualitative and quantitative evaluations demonstrate that IPTalker consistently outperforms existing approaches in terms of realism, lip synchronization, and identity retention, establishing a new state of the art for high-quality, identity-consistent video dubbing.
Paper Structure (16 sections, 13 equations, 8 figures, 3 tables)

This paper contains 16 sections, 13 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Our proposed IPTalker generates dubbed videos using a few reference images to provide identity-specific priors and driving audio to dictate mouth shapes. The synthesis is achieved by inpainting the masked source image, ensuring rich textural details and temporal consistency.
  • Figure 2: The framework of our method consists of three components: (a) Alignment Module, where reference mouth images and driving audio are input into encoders to extract embeddings. The Audio-Visual Alignment Unit (AVAU) captures the relationships among all embeddings to obtain an identity-audio correspondence embedding. (b) Warping Module, which uses the reference image and the identity-audio correspondence embedding to generate a motion flow that deforms the reference image to match the target configuration dictated by the audio. (c) Inpainting Module, which inpaints the masked source image to produce the final generated image.
  • Figure 3: Illustration of the Audio-Visual Alignment Unit (AVAU) and the cross-modalities encoder $\mathcal{E}_{cm}$. The AVAU captures the intricate interplay between audio and visual embeddings through self-attention and cross-attention mechanisms (see a). The cross-modalities encoder $\mathcal{E}_{cm}$ compresses the output of multiple AVAUs using 1D convolution and average pooling (see b).
  • Figure 4: We obtain a precise mask by calculating the convex hull of the lower-half facial landmarks set.
  • Figure 5: We paste the generated face onto the original frame using a Gaussian-smoothed mask to eliminate artifacts around the facial region.
  • ...and 3 more figures