Training-Free Video Editing via Optical Flow-Enhanced Score Distillation
Lianghan Zhu, Yanqi Bao, Jing Huo, Jing Wu, Yu-Kun Lai, Wenbin Li, Yang Gao
TL;DR
This paper tackles training-free video editing by directly optimizing the original video latent with editing gradients derived from pre-trained text-to-video diffusion models. It introduces an optical-flow-guided gradient refinement strategy to promote local temporal continuity and two auxiliary losses—content preservation and global semantic consistency—to suppress over-editing and improve global coherence. Empirical results on real videos show the approach achieves competitive or superior preservation of non-edited regions, stronger temporal continuity, and robust alignment with target prompts compared to state-of-the-art zero-shot baselines, while highlighting the dependence on foundation-model capabilities. The work advances practical, training-free video editing by integrating robust gradients with video priors, enabling more reliable edits in real-world content.
Abstract
The rapid advancement in visual generation, particularly the emergence of pre-trained text-to-image and text-to-video models, has catalyzed growing interest in training-free video editing research. Mirroring training-free image editing techniques, current approaches preserve original video information through video input inversion and manipulating intermediate features and attention during the inference process to achieve content editing. Although they have demonstrated promising results, the lossy nature of the inversion process poses significant challenges in maintaining unedited regions of the video. Furthermore, feature and attention manipulation during inference can lead to unintended over-editing and face challenges in both local temporal continuity and global content consistency. To address these challenges, this study proposes a score distillation paradigm based on pre-trained text-to-video models, where the original video is iteratively optimized through multiple steps guided by editing gradients provided by score distillation to ultimately obtain the target video. The iterative optimization starting from the original video, combined with content preservation loss, ensures the maintenance of unedited regions in the original video and suppresses over-editing. To further guarantee video content consistency and temporal continuity, we additionally introduce a global consistency auxiliary loss and optical flow prediction-based local editing gradient smoothing. Experiments demonstrate that these strategies effectively address the aforementioned challenges, achieving comparable or superior performance across multiple dimensions including preservation of unedited regions, local temporal continuity, and global content consistency of editing results, compared to state-of-the-art methods.
