Table of Contents
Fetching ...

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

Hyeonho Jeong, Jinho Chang, Geon Yeong Park, Jong Chul Ye

TL;DR

This work focuses on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion, and demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.

Abstract

Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

TL;DR

This work focuses on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion, and demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.

Abstract

Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.
Paper Structure (35 sections, 15 equations, 20 figures, 4 tables)

This paper contains 35 sections, 15 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Zero-shot video editing results. The second row presents videos produced with our method with a non-cascaded video diffusion model, while those in the bottom row are from a cascaded model. For a full display of results, visit our https://hyeonho99.github.io/dreammotion.
  • Figure 2: Ancestral sampling-based zero-shot video editing fails to capture complex, real-world motion in the generated videos.
  • Figure 3: Optimization progress visualization. The proposed self-similarity regularization effectively preserves the structure and motion of the original video.
  • Figure 4: Overview. DreamMotion leverages gradients derived from score distillation to inject target appearance, which is complemented by self-similarity alignments across spatial and temporal dimensions. This strategy seamlessly fits into cascaded video diffusion frameworks, confining the optimization on the keyframe generation phase.
  • Figure 5: The proposed space-time self-similarity regularization: (a) Spatial Self-Similarity Matching and (b) Temporal Self-Similarity Matching
  • ...and 15 more figures