Table of Contents
Fetching ...

FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

Youyuan Zhang, Xuan Ju, James J. Clark

TL;DR

This work proposes FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models, which eliminates the need for time-consuming inversion or additional condition extraction, reducing editing time and results in improved speed advantages.

Abstract

Diffusion models have demonstrated remarkable capabilities in text-to-image and text-to-video generation, opening up possibilities for video editing based on textual input. However, the computational cost associated with sequential sampling in diffusion models poses challenges for efficient video editing. Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion, making real-time applications impractical. In this work, we propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs). By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time. Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule. This results in improved speed advantages, as fewer sampling steps can be used while maintaining comparable generation quality. Experimental results validate the state-of-the-art performance and speed advantages of FastVideoEdit across evaluation metrics encompassing editing speed, temporal consistency, and text-video alignment.

FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

TL;DR

This work proposes FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models, which eliminates the need for time-consuming inversion or additional condition extraction, reducing editing time and results in improved speed advantages.

Abstract

Diffusion models have demonstrated remarkable capabilities in text-to-image and text-to-video generation, opening up possibilities for video editing based on textual input. However, the computational cost associated with sequential sampling in diffusion models poses challenges for efficient video editing. Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion, making real-time applications impractical. In this work, we propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs). By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time. Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule. This results in improved speed advantages, as fewer sampling steps can be used while maintaining comparable generation quality. Experimental results validate the state-of-the-art performance and speed advantages of FastVideoEdit across evaluation metrics encompassing editing speed, temporal consistency, and text-video alignment.
Paper Structure (25 sections, 17 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 17 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Editing Results of FastVideoEdit.FastVideoEdit offers efficient, consistent, high-quality, and text-aligned editing capabilities for both artificial (left col) and natural (right col) videos. The top row displays the source video, while the second and third rows showcase two edited videos. Each row features a text prompt at the top, with the edited words highlighted in red. This visual representation effectively demonstrates how our method can successfully achieve desired edits such as attribute change, object change, background change, and style change.
  • Figure 2: Overview of FastVideoEdit. Our model directly denoises three branches of batch frames using three attention control methods: $\text{CF-Masa}$, $\text{Re-CA}$ and $\text{Bg-Masa}$. The model uses batch consistency sampling (BCS) with LCMs to improve efficiency, background latent blending to align editing content with source video and TokenFlow propagation to further improve temporal consistency. The right shaded part elaborates detailed operation of using batch consistency sampling to estimate noise in editing branch and background branch.
  • Figure 3: Qualitative comparison of FastVideoEdit with previous video editing methods. The top row displays the source video, while the following rows showcase edited videos by previous editing methods and FastVideoEdit. Source and target text prompt at shown the top, with the edited words highlighted in red.
  • Figure 4: Illustration of ablation on model architecture.