Table of Contents
Fetching ...

LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang, Tianfan Xue

TL;DR

This work addresses the rigidity of first-frame-guided video editing by introducing a mask-aware LoRA fine-tuning framework that leverages a spatiotemporal mask to selectively preserve or modify content and to learn motion and appearance separately. It operates on pretrained image-to-video diffusion models without changing their architecture, using two training regimes to learn motion from the source video and appearance guidance from reference frames, while disentangling edits from the background. A dual-role loss and carefully designed conditioning allow region-specific editing and robust propagation across frames, enabling complex transformations with temporal coherence. Empirical results on Wan2.1-I2V and HunyuanVideo-I2V demonstrate superior qualitative and quantitative performance over state-of-the-art baselines, along with a practical low-cost training strategy that reduces memory requirements.

Abstract

Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our key innovation is using a spatiotemporal mask to strategically guide the LoRA fine-tuning process. This teaches the model two distinct skills: first, to interpret the mask as a command to either preserve content from the source video or generate new content in designated regions. Second, for these generated regions, LoRA learns to synthesize either temporally consistent motion inherited from the video or novel appearances guided by user-provided reference frames. This dual-capability LoRA grants users control over the edit's entire temporal evolution, allowing complex transformations like an object rotating or a flower blooming. Experimental results show our method achieves superior video editing performance compared to baseline methods. Project Page: https://cjeen.github.io/LoRAEdit

LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

TL;DR

This work addresses the rigidity of first-frame-guided video editing by introducing a mask-aware LoRA fine-tuning framework that leverages a spatiotemporal mask to selectively preserve or modify content and to learn motion and appearance separately. It operates on pretrained image-to-video diffusion models without changing their architecture, using two training regimes to learn motion from the source video and appearance guidance from reference frames, while disentangling edits from the background. A dual-role loss and carefully designed conditioning allow region-specific editing and robust propagation across frames, enabling complex transformations with temporal coherence. Empirical results on Wan2.1-I2V and HunyuanVideo-I2V demonstrate superior qualitative and quantitative performance over state-of-the-art baselines, along with a practical low-cost training strategy that reduces memory requirements.

Abstract

Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our key innovation is using a spatiotemporal mask to strategically guide the LoRA fine-tuning process. This teaches the model two distinct skills: first, to interpret the mask as a command to either preserve content from the source video or generate new content in designated regions. Second, for these generated regions, LoRA learns to synthesize either temporally consistent motion inherited from the video or novel appearances guided by user-provided reference frames. This dual-capability LoRA grants users control over the edit's entire temporal evolution, allowing complex transformations like an object rotating or a flower blooming. Experimental results show our method achieves superior video editing performance compared to baseline methods. Project Page: https://cjeen.github.io/LoRAEdit

Paper Structure

This paper contains 22 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Given a source video (top row), we achieve high-quality video editing guided by the first frame as a reference image (middle row), while maintaining flexibility for incorporating additional reference conditions (bottom row).
  • Figure 2: Exploring different mask configurations as an input condition to image-to-video model. Left: Input conditions, which include a mask and a pseudo-video. Right: A video generation result under different mask configurations.
  • Figure 3: Our mask-guided LoRA pipeline. Training (Top): LoRA is fine-tuned to learn motion from the masked source video (left) and appearance from a reference frame (right). Inference (Bottom): The trained LoRA applies the learned motion and appearance to an edited first-frame, producing a temporally consistent video.
  • Figure 4: Comparisons with state-of-the-art reference-guided video editing methods.
  • Figure 5: Comparisons with state-of-the-art first-frame-guided video editing methods.
  • ...and 4 more figures