Table of Contents
Fetching ...

Video4Edit: Viewing Image Editing as a Degenerate Temporal Process

Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, Dingkang Liang

TL;DR

Video4Edit recasts instruction-driven image editing as a degenerate temporal process to leverage video priors for data-efficient editing. It employs a teacher–student framework where a frozen video-pretrained teacher generates temporally coherent evolution trajectories, while a trainable student learns to produce edits from a source image and a concise instruction. The training combines block-wise DiT distillation to transfer temporal priors and tail-frame supervision in a 3D‑VAE latent space to anchor the final state, enabling strong performance with roughly $1\%$ of conventional supervision. Experiments on GEdit-Bench-EN and ImgEdit-Bench demonstrate competitive results and robust generalization across editing tasks, with improved non-edit-region preservation and locality. The approach offers a practical, scalable pathway toward flexible, data-efficient, instruction-driven editing that can extend to video and interactive workflows.

Abstract

We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \{instruction, source image, edited image\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.

Video4Edit: Viewing Image Editing as a Degenerate Temporal Process

TL;DR

Video4Edit recasts instruction-driven image editing as a degenerate temporal process to leverage video priors for data-efficient editing. It employs a teacher–student framework where a frozen video-pretrained teacher generates temporally coherent evolution trajectories, while a trainable student learns to produce edits from a source image and a concise instruction. The training combines block-wise DiT distillation to transfer temporal priors and tail-frame supervision in a 3D‑VAE latent space to anchor the final state, enabling strong performance with roughly of conventional supervision. Experiments on GEdit-Bench-EN and ImgEdit-Bench demonstrate competitive results and robust generalization across editing tasks, with improved non-edit-region preservation and locality. The approach offers a practical, scalable pathway toward flexible, data-efficient, instruction-driven editing that can extend to video and interactive workflows.

Abstract

We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \{instruction, source image, edited image\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.

Paper Structure

This paper contains 30 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Video4Edit: image editing as a degenerate temporal process. We view image edits through a temporal lens and categorize them into two families: temporal evolution (state changes over time with minimal spatial re-layout) and spatial evolution (structural reconfiguration). After rewriting the instruction into an evolution-style caption, a video-pretrained T2V model can often perform temporal-evolution edits in a zero-shot manner (though tasks such as replace still need additional consistency constraints), while spatial-evolution edits remain challenging. We find that a light fine-tuning of the video-pretrained model suffices to handle both families, enabling general-purpose image editing.
  • Figure 2: Video4Edit overall pipeline. We formulate image editing as a degenerate temporal process and adopt a teacher–student framework. The teacher (Wan2.1 FLF2V-14B wan2025) receives the source image as the first frame and the edited image as the last frame, guided by an offline evolution prompt distilled from the instruction, to roll out temporally coherent intermediate states. The student (Wan2.1 I2V-14B-720P) takes only the source image and instruction, learning from teacher signals to produce the edited result in a few steps at inference.
  • Figure 3: A Comparative Illustration of Our Method, Open-Source Approaches, and Commercial Systems.
  • Figure 4: Comparison with native I2V baseline. Even in zero-shot scenarios where the native I2V model can generate plausible edits, it often introduces inconsistencies in non-edit regions (e.g., background artifacts, color shifts, structural distortions). Video4Edit maintains better consistency in non-edit regions through explicit supervision and distillation-based training.
  • Figure 5: Multi-task support. Our method handles diverse editing tasks including subject addition, removal, replacement, background change, color alteration, and style transfer, demonstrating the versatility of our temporal-evolution framework.