Table of Contents
Fetching ...

Diffusion Model-Based Video Editing: A Survey

Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Dacheng Tao

TL;DR

This survey systematically catalogs diffusion-model-based video editing techniques, organizing them by core technologies such as network paradigms, attention injection, latent manipulation, canonical representations, and novel conditioning. It introduces V2VBench to benchmark 150 edits across 4 tasks with 10 metrics, enabling fair cross-method comparisons of 16 prominent approaches. The work highlights the trade-offs between editing fidelity, temporal coherence, and efficiency, and it discusses practical challenges including data scarcity, computational demands, and evaluation standardization. By mapping evolutionary trajectories and identifying open problems, the paper provides a roadmap for future advances in video editing with diffusion models and related diffusion-based foundations.

Abstract

The rapid development of diffusion models (DMs) has significantly advanced image and video applications, making "what you want is what you see" a reality. Among these, video editing has gained substantial attention and seen a swift rise in research activity, necessitating a comprehensive and systematic review of the existing literature. This paper reviews diffusion model-based video editing techniques, including theoretical foundations and practical applications. We begin by overviewing the mathematical formulation and image domain's key methods. Subsequently, we categorize video editing approaches by the inherent connections of their core technologies, depicting evolutionary trajectory. This paper also dives into novel applications, including point-based editing and pose-guided human video editing. Additionally, we present a comprehensive comparison using our newly introduced V2VBench. Building on the progress achieved to date, the paper concludes with ongoing challenges and potential directions for future research.

Diffusion Model-Based Video Editing: A Survey

TL;DR

This survey systematically catalogs diffusion-model-based video editing techniques, organizing them by core technologies such as network paradigms, attention injection, latent manipulation, canonical representations, and novel conditioning. It introduces V2VBench to benchmark 150 edits across 4 tasks with 10 metrics, enabling fair cross-method comparisons of 16 prominent approaches. The work highlights the trade-offs between editing fidelity, temporal coherence, and efficiency, and it discusses practical challenges including data scarcity, computational demands, and evaluation standardization. By mapping evolutionary trajectories and identifying open problems, the paper provides a roadmap for future advances in video editing with diffusion models and related diffusion-based foundations.

Abstract

The rapid development of diffusion models (DMs) has significantly advanced image and video applications, making "what you want is what you see" a reality. Among these, video editing has gained substantial attention and seen a swift rise in research activity, necessitating a comprehensive and systematic review of the existing literature. This paper reviews diffusion model-based video editing techniques, including theoretical foundations and practical applications. We begin by overviewing the mathematical formulation and image domain's key methods. Subsequently, we categorize video editing approaches by the inherent connections of their core technologies, depicting evolutionary trajectory. This paper also dives into novel applications, including point-based editing and pose-guided human video editing. Additionally, we present a comprehensive comparison using our newly introduced V2VBench. Building on the progress achieved to date, the paper concludes with ongoing challenges and potential directions for future research.
Paper Structure (42 sections, 48 equations, 12 figures, 1 table)

This paper contains 42 sections, 48 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: This survey focuses on video editing and video-to-video translation tasks. Results depicted here are sourced from ymc+23mos+23CeylanHM23ryj+23yyb+23zjj+23 and will be discussed in \ref{['sec:desc-based']}.
  • Figure 2: The architecture of LDM rad+22. It consists of an encoder $\mathcal{E}$, a decoder $\mathcal{D}$, and a UNet noise predictor $\epsilon_\theta$ with cross-attention blocks to incorporate conditioning from domain encoders $\tau_\theta$. Figure adapted from LDM rad+22.
  • Figure 3: Feature injection methods for image editing. P2P hmt+22 injects cross-attention maps from the reconstruction branch into the editing branch. PnP tgb+23 injects self-attention maps. While MasaCtrl cwq+23 injects self-attention query and key features.
  • Figure 4: The temporal extension from image to video models involves incorporating 1D temporal convolutions after each 2D spatial convolution (left). Additionally, each 2D spatial attention block is followed by a 1D temporal attention block (right). Figure adapted from Gen-1 pjp+23.
  • Figure 5: Self-attention variations. The query tokens are in red, and the key and value tokens are in blue. $W$, $H$, and $F$ denote the input video's width, height, and frame numbers.
  • ...and 7 more figures