EffiVED:Efficient Video Editing via Text-instruction Diffusion Models
Zhenghao Zhang, Zuozhuo Dai, Long Qin, Weizhi Wang
TL;DR
EffiVED tackles the data scarcity and per-video finetuning bottlenecks in interactive video editing by introducing two data-collection workflows that convert image-editing data and open-world videos into a large synthetic training set. It trains a latent-diffusion video editor with a conditional 3D U-Net and decoupled guidance to perform instruction-guided edits directly on input videos, without per-video fine-tuning. On TGVE, EffiVED delivers high-fidelity edits with strong temporal coherence and achieves substantial runtime speedups (approximately 6–28× faster) compared to prior methods, validating the practicality of data-driven synthetic training for video editing. This approach demonstrates how structured cross-modal data synthesis can effectively overcome video-editing data scarcity, enabling scalable, real-world editing capabilities.
Abstract
Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that directly supports instruction-guided video editing. To achieve this, we present two efficient workflows to gather video editing pairs, utilizing augmentation and fundamental vision-language techniques. These workflows transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED. Experimental results reveal that EffiVED not only generates high-quality editing videos but also executes rapidly. Finally, we demonstrate that our data collection method significantly improves editing performance and can potentially tackle the scarcity of video editing data. Code can be found at https://github.com/alibaba/EffiVED.
