Video Editing via Factorized Diffusion Distillation

Uriel Singer; Amit Zohar; Yuval Kirstain; Shelly Sheynin; Adam Polyak; Devi Parikh; Yaniv Taigman

Video Editing via Factorized Diffusion Distillation

Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, Yaniv Taigman

TL;DR

This work introduces Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data, and introduces a new unsupervised distillation procedure, Factorized Diffusion Distillation.

Abstract

We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters

Video Editing via Factorized Diffusion Distillation

TL;DR

Abstract

Paper Structure (23 sections, 6 equations, 5 figures, 2 tables)

This paper contains 23 sections, 6 equations, 5 figures, 2 tables.

Introduction
Related Work
Method
Architecture
Video Generation Adapter
Image Editing Adapter
Combining The Adapters
Final Architecture
Factorized Diffusion Distillation
Implementation Details
K-Bin Diffusion Sampling
Discriminator Architecture
Experiments
Metrics.
Unsupervised Dataset
...and 8 more sections

Figures (5)

Figure 1: EVE is a text-guided video editing model that enables various editing tasks.
Figure 2: Model architecture and alignment procedure. We train an adapter for image editing (in blue) and video generation (in orange) on top of a shared text-to-image backbone. Then, we create a student network by stacking both adapters together on the shared backbone (in green) and align the two adapters. The student is trained using (i) score distillation from each frozen teacher adapter (marked as SDS), (ii) adversarial loss for each teacher (in pink). SDS is calculated on samples generated by the student from noise and the discriminators attempt to differentiate between samples generated by the teachers and the student.
Figure 3: Comparison of our model against baselines using examples from the Text-Guided Video Editing (TGVE) wu2023cvpr benchmark and our extension of it.
Figure 4: Our model performs zero-shot video editing for tasks that Emu Edit can execute on images, without explicitly training on them during alignment.
Figure 5: We apply FDD to combine an editing adapter with LoRA-based adapters: (1) a subject-driven adapter for a dog, (2) a subject-driven adapter for a toy robot, (3) a style-driven adapter for line art, and (4) a style-driven adapter for stickers.

Video Editing via Factorized Diffusion Distillation

TL;DR

Abstract

Video Editing via Factorized Diffusion Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)