Table of Contents
Fetching ...

RASA: Replace Anyone, Say Anything -- A Training-Free Framework for Audio-Driven and Universal Portrait Video Editing

Tianrui Pan, Lin Liu, Jie Liu, Xiaopeng Zhang, Jie Tang, Gangshan Wu, Qi Tian

TL;DR

RASA tackles training-free universal portrait video editing by leveraging inversion latents and DDIM-based perturbations guided by appearance from the first frame and speech inputs. It introduces Unified Animation Control (UAC) to separately control shape, audio-driven lip motion, and temporal coherence, enabling appearance editing, lip editing, or their combination. The approach demonstrates superior lip synchronization and flexible motion transfer for appearance editing, while accommodating head rotations and expression changes via first-frame processing. Evaluations on the HDTF dataset show consistent improvements over state-of-the-art baselines in lip editing quality and appearance editing coherence, highlighting its practical potential for rapid, training-free portrait video manipulation across identities and speech content.

Abstract

Portrait video editing focuses on modifying specific attributes of portrait videos, guided by audio or video streams. Previous methods typically either concentrate on lip-region reenactment or require training specialized models to extract keypoints for motion transfer to a new identity. In this paper, we introduce a training-free universal portrait video editing framework that provides a versatile and adaptable editing strategy. This framework supports portrait appearance editing conditioned on the changed first reference frame, as well as lip editing conditioned on varied speech, or a combination of both. It is based on a Unified Animation Control (UAC) mechanism with source inversion latents to edit the entire portrait, including visual-driven shape control, audio-driven speaking control, and inter-frame temporal control. Furthermore, our method can be adapted to different scenarios by adjusting the initial reference frame, enabling detailed editing of portrait videos with specific head rotations and facial expressions. This comprehensive approach ensures a holistic and flexible solution for portrait video editing. The experimental results show that our model can achieve more accurate and synchronized lip movements for the lip editing task, as well as more flexible motion transfer for the appearance editing task. Demo is available at https://alice01010101.github.io/RASA/.

RASA: Replace Anyone, Say Anything -- A Training-Free Framework for Audio-Driven and Universal Portrait Video Editing

TL;DR

RASA tackles training-free universal portrait video editing by leveraging inversion latents and DDIM-based perturbations guided by appearance from the first frame and speech inputs. It introduces Unified Animation Control (UAC) to separately control shape, audio-driven lip motion, and temporal coherence, enabling appearance editing, lip editing, or their combination. The approach demonstrates superior lip synchronization and flexible motion transfer for appearance editing, while accommodating head rotations and expression changes via first-frame processing. Evaluations on the HDTF dataset show consistent improvements over state-of-the-art baselines in lip editing quality and appearance editing coherence, highlighting its practical potential for rapid, training-free portrait video manipulation across identities and speech content.

Abstract

Portrait video editing focuses on modifying specific attributes of portrait videos, guided by audio or video streams. Previous methods typically either concentrate on lip-region reenactment or require training specialized models to extract keypoints for motion transfer to a new identity. In this paper, we introduce a training-free universal portrait video editing framework that provides a versatile and adaptable editing strategy. This framework supports portrait appearance editing conditioned on the changed first reference frame, as well as lip editing conditioned on varied speech, or a combination of both. It is based on a Unified Animation Control (UAC) mechanism with source inversion latents to edit the entire portrait, including visual-driven shape control, audio-driven speaking control, and inter-frame temporal control. Furthermore, our method can be adapted to different scenarios by adjusting the initial reference frame, enabling detailed editing of portrait videos with specific head rotations and facial expressions. This comprehensive approach ensures a holistic and flexible solution for portrait video editing. The experimental results show that our model can achieve more accurate and synchronized lip movements for the lip editing task, as well as more flexible motion transfer for the appearance editing task. Demo is available at https://alice01010101.github.io/RASA/.

Paper Structure

This paper contains 16 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The overall pipeline of our model. RASA enables training-free portrait video editing by leveraging inversion latents of the source video, guided by any portrait identity or speech.
  • Figure 2: The overall structure of RASA. The left part (a) illustrates the tuning-free universal portrait video editing pipeline. The right part (b) details the denoising steps with unified animation control, outlining the process for achieving multi-task portrait video editing.
  • Figure 3: Qualitative comparisons for the portrait lip editing. We selected two edited video samples from the HDTF dataset to compare our methods with others. In part (a), our audio-driven approach shows more natural lip movements and better synchronization. Part (b) highlights the robustness of our results in various scenarios, including silent speech and head rotations.
  • Figure 4: Qualitative comparisons for the task of portrait appearance editing. To achieve portrait video appearance editing, we first apply background inpainting to obtain an edited first frame as the reference frame, using the target portrait as the foreground and the source background. We then input this reference image into our model to implement the portrait video appearance editing.
  • Figure 5: Portrait appearance editing with different views. From left to right, we sequentially edit the first reference image by changing the background, applying head rotations, and expression editing.
  • ...and 2 more figures