VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing
Abdelilah Aitrouga, Youssef Hmamouche, Amal El Fallah Seghrouchni
TL;DR
This work tackles the heavy computational burden of quadratic attention in diffusion-based video editing by introducing VRWKV-Editor, which integrates a linear spatio-temporal aggregation (VRWKV) into a diffusion framework. The method encodes videos into a latent space and employs a 3D-VRWKV module within a U-Net to perform noise prediction for edited outputs, dramatically reducing time and memory while preserving temporal coherence and text alignment. Empirical results show up to 3.7x speedups and 60% memory savings over state-of-the-art diffusion-based editors, with performance comparable in frame consistency and prompt fidelity, and benefits become more pronounced on longer videos. This approach advances practical, real-time-capable video editing by leveraging linear attention concepts in a diffusion-based, video-oriented architecture.
Abstract
In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.
