LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang, Tianfan Xue
TL;DR
This work addresses the rigidity of first-frame-guided video editing by introducing a mask-aware LoRA fine-tuning framework that leverages a spatiotemporal mask to selectively preserve or modify content and to learn motion and appearance separately. It operates on pretrained image-to-video diffusion models without changing their architecture, using two training regimes to learn motion from the source video and appearance guidance from reference frames, while disentangling edits from the background. A dual-role loss and carefully designed conditioning allow region-specific editing and robust propagation across frames, enabling complex transformations with temporal coherence. Empirical results on Wan2.1-I2V and HunyuanVideo-I2V demonstrate superior qualitative and quantitative performance over state-of-the-art baselines, along with a practical low-cost training strategy that reduces memory requirements.
Abstract
Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our key innovation is using a spatiotemporal mask to strategically guide the LoRA fine-tuning process. This teaches the model two distinct skills: first, to interpret the mask as a command to either preserve content from the source video or generate new content in designated regions. Second, for these generated regions, LoRA learns to synthesize either temporally consistent motion inherited from the video or novel appearances guided by user-provided reference frames. This dual-capability LoRA grants users control over the edit's entire temporal evolution, allowing complex transformations like an object rotating or a flower blooming. Experimental results show our method achieves superior video editing performance compared to baseline methods. Project Page: https://cjeen.github.io/LoRAEdit
