Table of Contents
Fetching ...

Text-based Talking Video Editing with Cascaded Conditional Diffusion

Bo Han, Heqing Zou, Haoyang Li, Guangcong Wang, Chng Eng Siong

TL;DR

The paper tackles text-based talking-head video editing by proposing a cascaded diffusion framework that decomposes editing into two stages: audio-to-dense-landmark motion and motion-to-video rendering. Stage 1 uses a dynamic weighted in-context diffusion model to translate edited audio into dense-landmark motions encoded as identity-aware landmark images, with a loss $L_{obj}$ guiding the diffusion process. Stage 2 employs a warping-guided diffusion model that first performs ID-preserving interpolation-warping to generate coarse frames and then refines them conditioned on warped intermediates and audio, ensuring smooth, coherent, and identity-preserving video synthesis. On the HDTF dataset, the method achieves superior image quality and audio-visual consistency compared to state-of-the-art baselines, while reducing dependence on large per-identity training data, enabling effective sentence-level editing in a zero-shot setting.

Abstract

Text-based talking-head video editing aims to efficiently insert, delete, and substitute segments of talking videos through a user-friendly text editing approach. It is challenging because of \textbf{1)} generalizable talking-face representation, \textbf{2)} seamless audio-visual transitions, and \textbf{3)} identity-preserved talking faces. Previous works either require minutes of talking-face video training data and expensive test-time optimization for customized talking video editing or directly generate a video sequence without considering in-context information, leading to a poor generalizable representation, or incoherent transitions, or even inconsistent identity. In this paper, we propose an efficient cascaded conditional diffusion-based framework, which consists of two stages: audio to dense-landmark motion and motion to video. \textit{\textbf{In the first stage}}, we first propose a dynamic weighted in-context diffusion module to synthesize dense-landmark motions given an edited audio. \textit{\textbf{In the second stage}}, we introduce a warping-guided conditional diffusion module. The module first interpolates between the start and end frames of the editing interval to generate smooth intermediate frames. Then, with the help of the audio-to-dense motion images, these intermediate frames are warped to obtain coarse intermediate frames. Conditioned on the warped intermedia frames, a diffusion model is adopted to generate detailed and high-resolution target frames, which guarantees coherent and identity-preserved transitions. The cascaded conditional diffusion model decomposes the complex talking editing task into two flexible generation tasks, which provides a generalizable talking-face representation, seamless audio-visual transitions, and identity-preserved faces on a small dataset. Experiments show the effectiveness and superiority of the proposed method.

Text-based Talking Video Editing with Cascaded Conditional Diffusion

TL;DR

The paper tackles text-based talking-head video editing by proposing a cascaded diffusion framework that decomposes editing into two stages: audio-to-dense-landmark motion and motion-to-video rendering. Stage 1 uses a dynamic weighted in-context diffusion model to translate edited audio into dense-landmark motions encoded as identity-aware landmark images, with a loss guiding the diffusion process. Stage 2 employs a warping-guided diffusion model that first performs ID-preserving interpolation-warping to generate coarse frames and then refines them conditioned on warped intermediates and audio, ensuring smooth, coherent, and identity-preserving video synthesis. On the HDTF dataset, the method achieves superior image quality and audio-visual consistency compared to state-of-the-art baselines, while reducing dependence on large per-identity training data, enabling effective sentence-level editing in a zero-shot setting.

Abstract

Text-based talking-head video editing aims to efficiently insert, delete, and substitute segments of talking videos through a user-friendly text editing approach. It is challenging because of \textbf{1)} generalizable talking-face representation, \textbf{2)} seamless audio-visual transitions, and \textbf{3)} identity-preserved talking faces. Previous works either require minutes of talking-face video training data and expensive test-time optimization for customized talking video editing or directly generate a video sequence without considering in-context information, leading to a poor generalizable representation, or incoherent transitions, or even inconsistent identity. In this paper, we propose an efficient cascaded conditional diffusion-based framework, which consists of two stages: audio to dense-landmark motion and motion to video. \textit{\textbf{In the first stage}}, we first propose a dynamic weighted in-context diffusion module to synthesize dense-landmark motions given an edited audio. \textit{\textbf{In the second stage}}, we introduce a warping-guided conditional diffusion module. The module first interpolates between the start and end frames of the editing interval to generate smooth intermediate frames. Then, with the help of the audio-to-dense motion images, these intermediate frames are warped to obtain coarse intermediate frames. Conditioned on the warped intermedia frames, a diffusion model is adopted to generate detailed and high-resolution target frames, which guarantees coherent and identity-preserved transitions. The cascaded conditional diffusion model decomposes the complex talking editing task into two flexible generation tasks, which provides a generalizable talking-face representation, seamless audio-visual transitions, and identity-preserved faces on a small dataset. Experiments show the effectiveness and superiority of the proposed method.
Paper Structure (20 sections, 3 equations, 3 figures, 2 tables)

This paper contains 20 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: We implement talking-head video editing tasks in zero-shot scenarios, without the need for finetuning on specific character data. Frames marked in green are generated by our method. Our approach is not limited to word-level editing but also facilitates sentence-level editing.
  • Figure 2: Overview of the proposed method. CA represents the cross-attention mechanism.
  • Figure 3: Visualized results of ablation study.