Training-Free Video Editing via Optical Flow-Enhanced Score Distillation

Lianghan Zhu; Yanqi Bao; Jing Huo; Jing Wu; Yu-Kun Lai; Wenbin Li; Yang Gao

Training-Free Video Editing via Optical Flow-Enhanced Score Distillation

Lianghan Zhu, Yanqi Bao, Jing Huo, Jing Wu, Yu-Kun Lai, Wenbin Li, Yang Gao

TL;DR

This paper tackles training-free video editing by directly optimizing the original video latent with editing gradients derived from pre-trained text-to-video diffusion models. It introduces an optical-flow-guided gradient refinement strategy to promote local temporal continuity and two auxiliary losses—content preservation and global semantic consistency—to suppress over-editing and improve global coherence. Empirical results on real videos show the approach achieves competitive or superior preservation of non-edited regions, stronger temporal continuity, and robust alignment with target prompts compared to state-of-the-art zero-shot baselines, while highlighting the dependence on foundation-model capabilities. The work advances practical, training-free video editing by integrating robust gradients with video priors, enabling more reliable edits in real-world content.

Abstract

The rapid advancement in visual generation, particularly the emergence of pre-trained text-to-image and text-to-video models, has catalyzed growing interest in training-free video editing research. Mirroring training-free image editing techniques, current approaches preserve original video information through video input inversion and manipulating intermediate features and attention during the inference process to achieve content editing. Although they have demonstrated promising results, the lossy nature of the inversion process poses significant challenges in maintaining unedited regions of the video. Furthermore, feature and attention manipulation during inference can lead to unintended over-editing and face challenges in both local temporal continuity and global content consistency. To address these challenges, this study proposes a score distillation paradigm based on pre-trained text-to-video models, where the original video is iteratively optimized through multiple steps guided by editing gradients provided by score distillation to ultimately obtain the target video. The iterative optimization starting from the original video, combined with content preservation loss, ensures the maintenance of unedited regions in the original video and suppresses over-editing. To further guarantee video content consistency and temporal continuity, we additionally introduce a global consistency auxiliary loss and optical flow prediction-based local editing gradient smoothing. Experiments demonstrate that these strategies effectively address the aforementioned challenges, achieving comparable or superior performance across multiple dimensions including preservation of unedited regions, local temporal continuity, and global content consistency of editing results, compared to state-of-the-art methods.

Training-Free Video Editing via Optical Flow-Enhanced Score Distillation

TL;DR

Abstract

Paper Structure (24 sections, 11 equations, 7 figures, 2 tables)

This paper contains 24 sections, 11 equations, 7 figures, 2 tables.

Introduction
Related Works
Text-to-Video Generation
Text-based Video Editing
Preliminaries
Text-conditioned Latent Video Diffusion Models.
Score Distillation for Image Editing.
Method
Overview
Optical Flow-Guided DDS Editing Gradient Refinement
Content Preservation and Global Semantic Consistency Auxiliary Losses
Experiments
Experiments Setting
Data Preparation
Experiment Details
...and 9 more sections

Figures (7)

Figure 1: Some video editing examples. Compared to SOTA, our method achieves superior results in preserving the non-edited content of the original video, ensuring consistency and continuity in the edited results, and alignment with the target prompts.
Figure 2: Visualization of One-Step Prediction of $\mathbf{Z}_0$ by Different Models at Large and Small Time Steps. The predicted $\mathbf{Z}_0$ is computed from the one-step predicted noise, and since visualizing $\mathbf{Z}_0$ is more meaningful, we choose to visualize $\mathbf{Z}_0$.
Figure 3: Overview of Pipeline. Our pipeline comprises reference and editing branches. We employ optical flow-guided gradient refinement to enhance the continuity of gradient predictions for consecutive frame editing. Additionally, by introducing auxiliary losses for global semantic consistency and content preservation, we improve the video's global semantic consistency and maintain the original video content in non-edited regions.
Figure 4: Qualitative Comparison. Our method achieves a good balance between editing and preserving unedited information in the original video. Additionally, it possesses a certain capability for geometric shape editing.
Figure 5: Abalation of optical flow-guided gradient correction strategy, content preservation auxiliary loss, and global semantic consistency auxiliary loss. In most cases, using all the components we propose achieves the best balance between editing and preservation, and achieves the highest edited video quality.
...and 2 more figures

Training-Free Video Editing via Optical Flow-Enhanced Score Distillation

TL;DR

Abstract

Training-Free Video Editing via Optical Flow-Enhanced Score Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)