Table of Contents
Fetching ...

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, Xingang Pan

TL;DR

To address the gap between image and video editing, the paper introduces I2VEdit, which propagates first-frame edits to videos using a pre-trained image-to-video diffusion model. It decouples content edits from motion preservation via two pipelines: Coarse Motion Extraction with Motion LoRAs and skip-interval cross-attention, and Appearance Refinement with EDM inversion and fine-grained attention matching, augmented by SARP. The method achieves high-quality, temporally consistent edits, enabling fine-grained local edits and global style transfers guided by a single edited frame. This work demonstrates strong improvements over prior image-guided and text-guided video editing approaches and offers practical tooling for frame-accurate video edits.

Abstract

The remarkable generative capabilities of diffusion models have motivated extensive research in both image and video editing. Compared to video editing which faces additional challenges in the time dimension, image editing has witnessed the development of more diverse, high-quality approaches and more capable software like Photoshop. In light of this gap, we introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits, effectively handling global edits, local edits, and moderate shape changes, which existing methods cannot fully achieve. At the core of our method are two main processes: Coarse Motion Extraction to align basic motion patterns with the original video, and Appearance Refinement for precise adjustments using fine-grained attention matching. We also incorporate a skip-interval strategy to mitigate quality degradation from auto-regressive generation across multiple video clips. Experimental results demonstrate our framework's superior performance in fine-grained video editing, proving its capability to produce high-quality, temporally consistent outputs.

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

TL;DR

To address the gap between image and video editing, the paper introduces I2VEdit, which propagates first-frame edits to videos using a pre-trained image-to-video diffusion model. It decouples content edits from motion preservation via two pipelines: Coarse Motion Extraction with Motion LoRAs and skip-interval cross-attention, and Appearance Refinement with EDM inversion and fine-grained attention matching, augmented by SARP. The method achieves high-quality, temporally consistent edits, enabling fine-grained local edits and global style transfers guided by a single edited frame. This work demonstrates strong improvements over prior image-guided and text-guided video editing approaches and offers practical tooling for frame-accurate video edits.

Abstract

The remarkable generative capabilities of diffusion models have motivated extensive research in both image and video editing. Compared to video editing which faces additional challenges in the time dimension, image editing has witnessed the development of more diverse, high-quality approaches and more capable software like Photoshop. In light of this gap, we introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits, effectively handling global edits, local edits, and moderate shape changes, which existing methods cannot fully achieve. At the core of our method are two main processes: Coarse Motion Extraction to align basic motion patterns with the original video, and Appearance Refinement for precise adjustments using fine-grained attention matching. We also incorporate a skip-interval strategy to mitigate quality degradation from auto-regressive generation across multiple video clips. Experimental results demonstrate our framework's superior performance in fine-grained video editing, proving its capability to produce high-quality, temporally consistent outputs.
Paper Structure (24 sections, 8 equations, 22 figures, 2 tables)

This paper contains 24 sections, 8 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Our video editing pipeline. Given the first frame edited by users using an image editing tool (e.g., EditAnything gao2023editanything), our model generates videos consistent with first frames, while preserving appearances and motion adaptively with source videos.
  • Figure 2: Our framework comprises two pipelines: Coarse Motion Extraction Pipeline (Training Stage) and Appearance Refinement Pipeline (Inference Stage). Coarse Motion Extraction Pipeline extracts coarse motion via learning skip-interval motion LoRAs for each clip. In the inference stage, Appearance Refinement Pipeline further refines the motion and appearance consistency through fine-grained attention matching between attentions during EDM Karras2022edm inversion and denoising.
  • Figure 3: Fine-Grained Attention Matching.
  • Figure 4: Qualitative comparison with image-guided video editing (colored as purple), text-guided video editing, and motion customization methods. We use EditAnything gao2023editanything to generate first-frame editing results for all image-guided video editing methods. "*" means the method utilizes an additional editing mask.
  • Figure 5: Comparison of ablation settings of our methods, using the same keyframe generated by AnyDoor chen2023anydoor.
  • ...and 17 more figures