RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Ozgur Kara; Bariscan Kurtkaya; Hidir Yesiltepe; James M. Rehg; Pinar Yanardag

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, Pinar Yanardag

TL;DR

RAVE tackles zero-shot video editing by leveraging pre-trained text-to-image diffusion models without additional training. It introduces a grid-based editing scheme and a novel noise shuffling strategy to enforce strong spatio-temporal interactions, enabling fast and temporally consistent edits for longer videos. The approach is conditioned with ControlNet and uses DDIM inversion to bridge image edits to video, achieving superior temporal coherence and textual alignment on a diverse 186-video dataset. The authors provide extensive qualitative, quantitative, and user-study evidence, show favorable runtime, and release code and data to facilitate reproducibility.

Abstract

Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However, video editing models have not yet reached the same level of visual quality and user control. To address this, we introduce RAVE, a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy, leveraging spatio-temporal interactions between frames, to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements, allowing it to handle longer videos. RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE, we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing, and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods. Our code, dataset and videos can be found in https://rave-video.github.io.

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

TL;DR

Abstract

Paper Structure (42 sections, 2 equations, 13 figures, 1 table)

This paper contains 42 sections, 2 equations, 13 figures, 1 table.

Introduction
Related Works
Text-driven image editing
Text-driven video editing with training
Text-driven video editing without training
Methodology
Preliminaries
Latent diffusion models (LDMs)
Denoising diffusion implicit models (DDIM)
ControlNet
Our approach
Grid trick
Grid trick for video editing
Noise shuffling
Dataset
...and 27 more sections

Figures (13)

Figure 1: RAVE is a lightweight and fast video editing method that enhances temporal consistency in video edits, utilizing pre-trained text-to-image diffusion models. It is capable of modifying local attributes, like changing a person's jacket (bottom right), and can also handle complex shape transformations, such as turning a wolf into a dinosaur (bottom left).
Figure 2: Comparison with existing attention modules. The first row shows the frames of the input video, followed by subsequent rows presenting the edited frames under various attention settings, with our approach in the last row. The column on the right provides zoomed-in crops of mountains (from the 1st and 3rd frames) and the car's bumper (from the 2nd and 4th frames).
Figure 3: An illustration of RAVE. Our process begins by performing a DDIM inversion with the pre-trained T2I model and condition extraction with an off-the-shelf condition preprocessor applied to the input video ($\mathcal{V}_K$). These conditions are subsequently input into ControlNet. In the RAVE video editing process, diffusion denoising is performed for T timesteps using condition grids ($\mathcal{C}_L$), latent grids ($\mathcal{G}_L^t$), and the target text prompt as input for ControlNet. Random shuffling is applied to the latent grids ($\mathcal{G}_L^t$) and condition grids ($\mathcal{C}_L$) at each denoising step. After T timesteps, the latent grids are rearranged, and the final output video ($\mathcal{V}_K^*$) is obtained.
Figure 4: Consistency across grids. Editing results are shown for (a) processing grids independently, (b) adapting sparse-causal attention using grids, and (c) applying RAVE. The rightmost column features a close-up of the car's front, highlighting temporal color changes per approach. RAVE produces consistent patches in all grids while other methods struggle with consistency.
Figure 5: Types of edits in our dataset.
...and 8 more figures

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

TL;DR

Abstract

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)