DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

Hyeonho Jeong; Jinho Chang; Geon Yeong Park; Jong Chul Ye

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

Hyeonho Jeong, Jinho Chang, Geon Yeong Park, Jong Chul Ye

TL;DR

This work focuses on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion, and demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.

Abstract

Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

TL;DR

Abstract

Paper Structure (35 sections, 15 equations, 20 figures, 4 tables)

This paper contains 35 sections, 15 equations, 20 figures, 4 tables.

Introduction
Background
Diffusion Models
Conditional Generation
Video Diffusion Models
DreamMotion
Overview
Appearance Injection
Image Score Distillation
Video Score Distillation with Masked Gradients
Structure Correction
Spatial Self-Similarity Matching
Temporal Smoothing
Temporal Self-Similarity Matching
Expansion to Cascade Video Diffusion
...and 20 more sections

Figures (20)

Figure 1: Zero-shot video editing results. The second row presents videos produced with our method with a non-cascaded video diffusion model, while those in the bottom row are from a cascaded model. For a full display of results, visit our https://hyeonho99.github.io/dreammotion.
Figure 2: Ancestral sampling-based zero-shot video editing fails to capture complex, real-world motion in the generated videos.
Figure 3: Optimization progress visualization. The proposed self-similarity regularization effectively preserves the structure and motion of the original video.
Figure 4: Overview. DreamMotion leverages gradients derived from score distillation to inject target appearance, which is complemented by self-similarity alignments across spatial and temporal dimensions. This strategy seamlessly fits into cascaded video diffusion frameworks, confining the optimization on the keyframe generation phase.
Figure 5: The proposed space-time self-similarity regularization: (a) Spatial Self-Similarity Matching and (b) Temporal Self-Similarity Matching
...and 15 more figures

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

TL;DR

Abstract

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (20)