Table of Contents
Fetching ...

Streaming Video Diffusion: Online Video Editing with Diffusion Models

Feng Chen, Zhen Yang, Bohan Zhuang, Qi Wu

TL;DR

<3-5 sentence high-level summary> This work defines online video editing as real-time, streaming-frame editing with temporal coherence and proposes Streaming Video Diffusion (SVDiff), a diffusion-based model augmented with a compact spatial-aware temporal memory to handle long-range temporal dynamics. By training on long videos via a segment-level scheme and using memory propagation across clips, SVDiff achieves coherent edits for streaming frames with real-time performance (15.2 FPS at $\(512\times512\)$). The approach outperforms baseline online methods and existing diffusion-based editors in both temporal consistency and edit fidelity, while maintaining efficiency through memory-based recurrence and fast sampling (LCM LoRA). Limitations include shot-change detection in very long videos, with future work aimed at improving robustness to scene transitions and complex motion.

Abstract

We present a novel task called online video editing, which is designed to edit \textbf{streaming} frames while maintaining temporal consistency. Unlike existing offline video editing assuming all frames are pre-established and accessible, online video editing is tailored to real-life applications such as live streaming and online chat, requiring (1) fast continual step inference, (2) long-term temporal modeling, and (3) zero-shot video editing capability. To solve these issues, we propose Streaming Video Diffusion (SVDiff), which incorporates the compact spatial-aware temporal recurrence into off-the-shelf Stable Diffusion and is trained with the segment-level scheme on large-scale long videos. This simple yet effective setup allows us to obtain a single model that is capable of executing a broad range of videos and editing each streaming frame with temporal coherence. Our experiments indicate that our model can edit long, high-quality videos with remarkable results, achieving a real-time inference speed of 15.2 FPS at a resolution of 512x512.

Streaming Video Diffusion: Online Video Editing with Diffusion Models

TL;DR

<3-5 sentence high-level summary> This work defines online video editing as real-time, streaming-frame editing with temporal coherence and proposes Streaming Video Diffusion (SVDiff), a diffusion-based model augmented with a compact spatial-aware temporal memory to handle long-range temporal dynamics. By training on long videos via a segment-level scheme and using memory propagation across clips, SVDiff achieves coherent edits for streaming frames with real-time performance (15.2 FPS at ). The approach outperforms baseline online methods and existing diffusion-based editors in both temporal consistency and edit fidelity, while maintaining efficiency through memory-based recurrence and fast sampling (LCM LoRA). Limitations include shot-change detection in very long videos, with future work aimed at improving robustness to scene transitions and complex motion.

Abstract

We present a novel task called online video editing, which is designed to edit \textbf{streaming} frames while maintaining temporal consistency. Unlike existing offline video editing assuming all frames are pre-established and accessible, online video editing is tailored to real-life applications such as live streaming and online chat, requiring (1) fast continual step inference, (2) long-term temporal modeling, and (3) zero-shot video editing capability. To solve these issues, we propose Streaming Video Diffusion (SVDiff), which incorporates the compact spatial-aware temporal recurrence into off-the-shelf Stable Diffusion and is trained with the segment-level scheme on large-scale long videos. This simple yet effective setup allows us to obtain a single model that is capable of executing a broad range of videos and editing each streaming frame with temporal coherence. Our experiments indicate that our model can edit long, high-quality videos with remarkable results, achieving a real-time inference speed of 15.2 FPS at a resolution of 512x512.
Paper Structure (14 sections, 8 equations, 8 figures, 3 tables)

This paper contains 14 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison between offline and online video editing. Offline video editing processes the whole video simultaneously and regards all frames as known. Online video editing operates each streaming frame with the temporal information from previous frames in a causal way.
  • Figure 2: Overview of Streaming Video Diffusion. (a) We propose spatial-aware temporal memory which is inserted with memory attention after each transformer block in Stable Diffusion. Then our method is trained on large-scale long videos by splitting the long video into short clips. (b) During inference, we denoise the noisy latent of streaming frame with classifier-free guidance (CFG) where each denoising step involves the U-Net conducting conditional and unconditional denoising with corresponding memory.
  • Figure 3: Qualitative editing results of long videos where the red number in the lower right corner denotes the frame index.
  • Figure 4: Visual comparison between baseline models and our method where the edit prompt is "a rabbit is eating pizza".
  • Figure 5: Performance comparison with baseline models in long video editing with different video lengths.
  • ...and 3 more figures