Table of Contents
Fetching ...

LOVECon: Text-driven Training-Free Long Video Editing with ControlNet

Zhenyi Liao, Zhijie Deng

TL;DR

LOVECon presents a training-free, text-driven approach for long video editing by operating on windowed segments and enforcing global coherence through cross-window attention, while preserving source content via DDIM inversion-guided latent fusion and reducing frame flicker with a frame interpolation model. Built on pre-trained Stable Diffusion and ControlNet, it introduces a practical pipeline with a four-frame cross-window context, a mask-based latent fusion mechanism, and dual-stage interpolation to handle hundreds of frames efficiently. Empirical results on object attribute edits, style transfer, and background replacement show improved fidelity and temporal consistency over baselines, with strong performance in long sequences (up to 128 frames). The work enables accessible, training-free long video editing while highlighting limitations related to shape changes and content-motion complexity, pointing to avenues for enhanced temporal robustness and broader editing capabilities.

Abstract

Leveraging pre-trained conditional diffusion models for video editing without further tuning has gained increasing attention due to its promise in film production, advertising, etc. Yet, seminal works in this line fall short in generation length, temporal coherence, or fidelity to the source video. This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing. As suggested by prior arts, we build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts. To break down the length constraints caused by limited computational memory, we split the long video into consecutive windows and develop a novel cross-window attention mechanism to ensure the consistency of global style and maximize the smoothness among windows. To achieve more accurate control, we extract the information from the source video via DDIM inversion and integrate the outcomes into the latent states of the generations. We also incorporate a video frame interpolation model to mitigate the frame-level flickering issue. Extensive empirical studies verify the superior efficacy of our method over competing baselines across scenarios, including the replacement of the attributes of foreground objects, style transfer, and background replacement. Besides, our method manages to edit videos comprising hundreds of frames according to user requirements. Our project is open-sourced and the project page is at https://github.com/zhijie-group/LOVECon.

LOVECon: Text-driven Training-Free Long Video Editing with ControlNet

TL;DR

LOVECon presents a training-free, text-driven approach for long video editing by operating on windowed segments and enforcing global coherence through cross-window attention, while preserving source content via DDIM inversion-guided latent fusion and reducing frame flicker with a frame interpolation model. Built on pre-trained Stable Diffusion and ControlNet, it introduces a practical pipeline with a four-frame cross-window context, a mask-based latent fusion mechanism, and dual-stage interpolation to handle hundreds of frames efficiently. Empirical results on object attribute edits, style transfer, and background replacement show improved fidelity and temporal consistency over baselines, with strong performance in long sequences (up to 128 frames). The work enables accessible, training-free long video editing while highlighting limitations related to shape changes and content-motion complexity, pointing to avenues for enhanced temporal robustness and broader editing capabilities.

Abstract

Leveraging pre-trained conditional diffusion models for video editing without further tuning has gained increasing attention due to its promise in film production, advertising, etc. Yet, seminal works in this line fall short in generation length, temporal coherence, or fidelity to the source video. This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing. As suggested by prior arts, we build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts. To break down the length constraints caused by limited computational memory, we split the long video into consecutive windows and develop a novel cross-window attention mechanism to ensure the consistency of global style and maximize the smoothness among windows. To achieve more accurate control, we extract the information from the source video via DDIM inversion and integrate the outcomes into the latent states of the generations. We also incorporate a video frame interpolation model to mitigate the frame-level flickering issue. Extensive empirical studies verify the superior efficacy of our method over competing baselines across scenarios, including the replacement of the attributes of foreground objects, style transfer, and background replacement. Besides, our method manages to edit videos comprising hundreds of frames according to user requirements. Our project is open-sourced and the project page is at https://github.com/zhijie-group/LOVECon.
Paper Structure (15 sections, 10 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 10 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Method overview. LOVECon is built upon Stable diffusion and ControlNet (omitted in the plot for simplicity) for long video editing. LOVECon splits the source video into consecutive windows and edits sequentially, where cross-window attention is employed to improve inter-window consistency. LOVECon takes inverted latents from DDIM Inversion for initialization. During the denoising process, LOVECon fuses the latent states of the edited frames with those of the source frames from DDIM Inversion (in the green box) to maintain the structure of the source video. LOVECon further incorporates a video interpolation model to address the frame-level flickering issue.
  • Figure 2: Comparison between our method and baselines. Due to space constraints, we only select 8 frames evenly from the edited video of $48$ frames. Refer to the supplementary material for the complete edited video. The results reflect that LOVECon excels in providing precise control over the editing process, effectively recovering intricate details from the source frames, and presenting high fidelity.
  • Figure 3: More editing results of our method using various prompts. These examples include attribute and background editing, and style transfer, demonstrating our method maintains high fidelity and consistency when editing long videos while preserving the desired effects.
  • Figure 4: Ablation study on cross-window attention. The first row shows the source frames and the second one refers to editing the video window by window individually. The following are edited with fully cross-frame attention-based ControlVideo-I using a hierarchical sampler and only with cross-window attention. We eliminate other modules for long video editing. These show that our pipeline with cross-window attention can achieve comparable results with the costly fully cross-frame attention.
  • Figure 5: Ablation study on frame interpolation. The first row indicates the source frames, and the second and third are edited without and with the frame interpolation mechanism. Upon closer inspection of the consecutive frames, images in the second row exhibit subtle differences in color and details, annotated by the black boxes, which can lead to a decline in video quality. In contrast, those in the third row demonstrate more consistent texture and color. Zoom in for more details.
  • ...and 1 more figures