Streaming Video Diffusion: Online Video Editing with Diffusion Models
Feng Chen, Zhen Yang, Bohan Zhuang, Qi Wu
TL;DR
<3-5 sentence high-level summary> This work defines online video editing as real-time, streaming-frame editing with temporal coherence and proposes Streaming Video Diffusion (SVDiff), a diffusion-based model augmented with a compact spatial-aware temporal memory to handle long-range temporal dynamics. By training on long videos via a segment-level scheme and using memory propagation across clips, SVDiff achieves coherent edits for streaming frames with real-time performance (15.2 FPS at $\(512\times512\)$). The approach outperforms baseline online methods and existing diffusion-based editors in both temporal consistency and edit fidelity, while maintaining efficiency through memory-based recurrence and fast sampling (LCM LoRA). Limitations include shot-change detection in very long videos, with future work aimed at improving robustness to scene transitions and complex motion.
Abstract
We present a novel task called online video editing, which is designed to edit \textbf{streaming} frames while maintaining temporal consistency. Unlike existing offline video editing assuming all frames are pre-established and accessible, online video editing is tailored to real-life applications such as live streaming and online chat, requiring (1) fast continual step inference, (2) long-term temporal modeling, and (3) zero-shot video editing capability. To solve these issues, we propose Streaming Video Diffusion (SVDiff), which incorporates the compact spatial-aware temporal recurrence into off-the-shelf Stable Diffusion and is trained with the segment-level scheme on large-scale long videos. This simple yet effective setup allows us to obtain a single model that is capable of executing a broad range of videos and editing each streaming frame with temporal coherence. Our experiments indicate that our model can edit long, high-quality videos with remarkable results, achieving a real-time inference speed of 15.2 FPS at a resolution of 512x512.
