Table of Contents
Fetching ...

EffiVED:Efficient Video Editing via Text-instruction Diffusion Models

Zhenghao Zhang, Zuozhuo Dai, Long Qin, Weizhi Wang

TL;DR

EffiVED tackles the data scarcity and per-video finetuning bottlenecks in interactive video editing by introducing two data-collection workflows that convert image-editing data and open-world videos into a large synthetic training set. It trains a latent-diffusion video editor with a conditional 3D U-Net and decoupled guidance to perform instruction-guided edits directly on input videos, without per-video fine-tuning. On TGVE, EffiVED delivers high-fidelity edits with strong temporal coherence and achieves substantial runtime speedups (approximately 6–28× faster) compared to prior methods, validating the practicality of data-driven synthetic training for video editing. This approach demonstrates how structured cross-modal data synthesis can effectively overcome video-editing data scarcity, enabling scalable, real-world editing capabilities.

Abstract

Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that directly supports instruction-guided video editing. To achieve this, we present two efficient workflows to gather video editing pairs, utilizing augmentation and fundamental vision-language techniques. These workflows transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED. Experimental results reveal that EffiVED not only generates high-quality editing videos but also executes rapidly. Finally, we demonstrate that our data collection method significantly improves editing performance and can potentially tackle the scarcity of video editing data. Code can be found at https://github.com/alibaba/EffiVED.

EffiVED:Efficient Video Editing via Text-instruction Diffusion Models

TL;DR

EffiVED tackles the data scarcity and per-video finetuning bottlenecks in interactive video editing by introducing two data-collection workflows that convert image-editing data and open-world videos into a large synthetic training set. It trains a latent-diffusion video editor with a conditional 3D U-Net and decoupled guidance to perform instruction-guided edits directly on input videos, without per-video fine-tuning. On TGVE, EffiVED delivers high-fidelity edits with strong temporal coherence and achieves substantial runtime speedups (approximately 6–28× faster) compared to prior methods, validating the practicality of data-driven synthetic training for video editing. This approach demonstrates how structured cross-modal data synthesis can effectively overcome video-editing data scarcity, enabling scalable, real-world editing capabilities.

Abstract

Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that directly supports instruction-guided video editing. To achieve this, we present two efficient workflows to gather video editing pairs, utilizing augmentation and fundamental vision-language techniques. These workflows transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED. Experimental results reveal that EffiVED not only generates high-quality editing videos but also executes rapidly. Finally, we demonstrate that our data collection method significantly improves editing performance and can potentially tackle the scarcity of video editing data. Code can be found at https://github.com/alibaba/EffiVED.
Paper Structure (13 sections, 6 equations, 8 figures, 2 tables)

This paper contains 13 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: On the left: EffiVED offers users a versatile range of video editing capabilities, including modifications to objects, backgrounds and style transfer. On the right: Temporal Consistency & Text Alignment vs. Runtime(s) comparison on the TGVE datasetDBLP:journals/corr/abs-2401-07781. For runtime, all methods are evaluated by editing a 60-frame, 512p$\times$512p video using A100 GPUs with the official implementation. EffiVED achieves an impressive inference speed of 47 seconds, offering a 6 to 28 times speed boost compared to existing methods without compromising the quality of editing.
  • Figure 2: An example of generating training data from image editing dataset. Given pairs of original and edited images, we randomly select and apply a set of affine transformations (e.g., rotation, crop, translation, or shearing) to both images. This approach generates a sequence of frames that simulate camera movement for each image.
  • Figure 3: An overview of generating training data with open-world videos. (i) First, we leverage CoCa and VideoBLIP to extract caption from both keyframes and the entire video content, which are then synthesized into a comprehensive summary by ChatGPT. (ii) Next, we utilize ChatGPT to generate editing instruction and edited caption by providing manually examples. (iii) Finally, the generated instruction and the original video feed an individual CoDeF model to produce edited video.
  • Figure 4: Overview of our training pipeline. We adopt the widely used 3D U-Net based video diffusion model wang2023videocomposer for video editing. To enable vision conditioning, we augment the 3D-UNet's input by appending extra channels to its initial convolutional layers. The input is essentially a channel-wise concatenation of video latents and noise.
  • Figure 5: A/B Comparison with current methods. Our method not only effectively aligns edited videos with instructions but also consistently preserves the video's structure.
  • ...and 3 more figures