Video4Edit: Viewing Image Editing as a Degenerate Temporal Process
Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, Dingkang Liang
TL;DR
Video4Edit recasts instruction-driven image editing as a degenerate temporal process to leverage video priors for data-efficient editing. It employs a teacher–student framework where a frozen video-pretrained teacher generates temporally coherent evolution trajectories, while a trainable student learns to produce edits from a source image and a concise instruction. The training combines block-wise DiT distillation to transfer temporal priors and tail-frame supervision in a 3D‑VAE latent space to anchor the final state, enabling strong performance with roughly $1\%$ of conventional supervision. Experiments on GEdit-Bench-EN and ImgEdit-Bench demonstrate competitive results and robust generalization across editing tasks, with improved non-edit-region preservation and locality. The approach offers a practical, scalable pathway toward flexible, data-efficient, instruction-driven editing that can extend to video and interactive workflows.
Abstract
We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \{instruction, source image, edited image\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.
