Table of Contents
Fetching ...

FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing

Maomao Li, Yunfei Liu, Yu Li

TL;DR

A zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit, which demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques.

Abstract

Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques. Project page: https://free-edit.github.io/page/.

FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing

TL;DR

A zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit, which demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques.

Abstract

Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques. Project page: https://free-edit.github.io/page/.
Paper Structure (17 sections, 11 equations, 9 figures, 4 tables)

This paper contains 17 sections, 11 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The pipeline illustration of our FREE-Edit. Top: It obeys an "inversion-then-editing" pipeline. Starting from inverted noisy latent ${\bm z}_1$, the reconstructed video ${\bm z}_0$ and edited video $\tilde{{\bm z}}_0$ takes the source ${{\mathbf X}}^{1}$ and edited first frame $\hat{{\mathbf X}}^{1}$ as condition signal, respectively. Bottom: We design an Editing-awaRE (REE) injection method, which designs a modulation weight $\bm{\lambda}$ to adaptively replaces the intermediate model representations ($\tilde{{\mathbf Q}}$ and $\tilde{{\mathbf K}}$) in the editing process with those (${\mathbf Q}$ and ${\mathbf K}$) in the reconstruction process through self-attention blocks. Here, we first use optical flow to warp the automatically calculated first-frame editing mask, which yields tracked editing masks for subsequent frames. Based on them, we compute the modulation weight $\bm{\lambda}$ for each token, where no injection is performed in the editing area.
  • Figure 2: Qualitative ablation of different threshold ($thr$) in Eq. (\ref{['map']}) for mask ${\mathbf M}^1$ generation, which are used to indicate editing regions.
  • Figure 2: More comparison between standard FREE-Edit (w/ REE injection), and that of w/o injection, and w/ vanilla injection.
  • Figure 3: Qualitative comparison between standard FREE-Edit (w/ REE injection), and that of w/o injection, and w/ vanilla injection.
  • Figure 3: Left: Results of large changes in viewpoint. Right: A failure case with fast motion.
  • ...and 4 more figures