Table of Contents
Fetching ...

VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing

Abdelilah Aitrouga, Youssef Hmamouche, Amal El Fallah Seghrouchni

TL;DR

This work tackles the heavy computational burden of quadratic attention in diffusion-based video editing by introducing VRWKV-Editor, which integrates a linear spatio-temporal aggregation (VRWKV) into a diffusion framework. The method encodes videos into a latent space and employs a 3D-VRWKV module within a U-Net to perform noise prediction for edited outputs, dramatically reducing time and memory while preserving temporal coherence and text alignment. Empirical results show up to 3.7x speedups and 60% memory savings over state-of-the-art diffusion-based editors, with performance comparable in frame consistency and prompt fidelity, and benefits become more pronounced on longer videos. This approach advances practical, real-time-capable video editing by leveraging linear attention concepts in a diffusion-based, video-oriented architecture.

Abstract

In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.

VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing

TL;DR

This work tackles the heavy computational burden of quadratic attention in diffusion-based video editing by introducing VRWKV-Editor, which integrates a linear spatio-temporal aggregation (VRWKV) into a diffusion framework. The method encodes videos into a latent space and employs a 3D-VRWKV module within a U-Net to perform noise prediction for edited outputs, dramatically reducing time and memory while preserving temporal coherence and text alignment. Empirical results show up to 3.7x speedups and 60% memory savings over state-of-the-art diffusion-based editors, with performance comparable in frame consistency and prompt fidelity, and benefits become more pronounced on longer videos. This approach advances practical, real-time-capable video editing by leveraging linear attention concepts in a diffusion-based, video-oriented architecture.

Abstract

In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.

Paper Structure

This paper contains 19 sections, 21 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Representative results produced by VRWKV-Editor. The proposed method is capable of object replacement, background modification, and style transformation, while consistently preserving the motion dynamics and visual characteristics of the original video. Additional qualitative examples are provided on the project page: https://abdo-rg.github.io/VRWKV-Editor/.
  • Figure 2: Pipeline of VRWKV-Editor: Given a text–video pair (e.g., “A man with a backpack hikes on a rocky terrain”) as input, our method leverages pretrained text-to-image diffusion models for text-to-video generation. The input video is first encoded into a discrete latent space, after which our U-Net architecture predicts the injected noise (a detailed illustration is depicted in Figure \ref{['fig1']}). During inference, a novel video is synthesized by inverting the discrete noise from the input video, guided by an edited prompt (e.g., “An astronaut with a jetpack floats above a Martian landscape, with red rocky terrains and tall”).
  • Figure 3: Architecture of the U-Net backbone employed in VRWKV-Editor. The design incorporates VRWKV modules within the skip connections, enabling efficient long-range dependency modeling while preserving fine-grained spatial information across down-sampling and up-sampling layers.
  • Figure 4: VRWKVduan2024vision module employed on VRWKV-Editor. The 3D-VRWKV architecture includes identical VRWKV encoder layer, an average pooling layer, and a linear prediction head. Token Shift denotes the quad-directional shift method tailed for vision tasks.
  • Figure 5: Qualitative comparison between CAMEL Zhang_2024_CVPR, CCEdit Feng_2024_CVPR, ControlVideo zhang2023controlvideo, Vid2Vid-zero wang2023zero, Swin-Editor aitrouga2025swin and our method. The first row presents the input video. The figure highlights two different scenarios: Object editing and object-background editing . The results showcase the superiority of our framework in editing videos across various scenarios, producing consistent and high quality outputs.
  • ...and 2 more figures