Table of Contents
Fetching ...

Are Image-to-Video Models Good Zero-Shot Image Editors?

Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang

TL;DR

This paper tackles zero-shot image editing by repurposing pretrained image-to-video diffusion models. It introduces IF-Edit, a tuning-free framework with three modules: Chain-of-Thought prompt enhancement, Temporal Latent Dropout, and Self-Consistent Post-Refinement, to enforce temporal coherence and high detail without finetuning. Through systematic evaluation on four benchmarks, IF-Edit demonstrates strong performance on non-rigid and reasoning-centric edits and competitive results on general instruction-based edits, signaling the value of video priors for unified image editing. The work offers a practical recipe for leveraging video diffusion models as image editors and provides insights into their strengths and limitations for temporally grounded generative reasoning.

Abstract

Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.

Are Image-to-Video Models Good Zero-Shot Image Editors?

TL;DR

This paper tackles zero-shot image editing by repurposing pretrained image-to-video diffusion models. It introduces IF-Edit, a tuning-free framework with three modules: Chain-of-Thought prompt enhancement, Temporal Latent Dropout, and Self-Consistent Post-Refinement, to enforce temporal coherence and high detail without finetuning. Through systematic evaluation on four benchmarks, IF-Edit demonstrates strong performance on non-rigid and reasoning-centric edits and competitive results on general instruction-based edits, signaling the value of video priors for unified image editing. The work offers a practical recipe for leveraging video diffusion models as image editors and provides insights into their strengths and limitations for temporally grounded generative reasoning.

Abstract

Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.

Paper Structure

This paper contains 12 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Visual results of IF-Edit. We propose IF-Edit (Image Editing by Generating Frames), a tuning-free framework that repurposes image-to-video diffusion models for zero-shot image editing. By leveraging the world-simulation priors of video models and our proposed modules, IF-Edit achieves physically consistent and semantically aligned edits, excelling in non-rigid transformations, temporal progression, and causal reasoning scenarios.
  • Figure 2: Comparison with previous methods. Unlike prior approaches that suffer from redundant video generation and costly VLM-based frame selection, we design efficient strategies in both stages: an efficient temporal dropout to reduce redundant computation during generation, and a fast self-consistent refinement for sharp and high-quality final results.
  • Figure 3: Visualization of expert behaviors. (§\ref{['sec:Temporal latent dropout']}) Given the prompt “She loosens her grip, the card drifts down,” the high-noise expert quickly builds a coherent global layout but lacks fine detail, whereas the low-noise expert enhances local fidelity while losing spatial consistency. This observation motivates preserving early layout formation and refining details efficiently.
  • Figure 4: Overview of IF-Edit. (§\ref{['sec:method']}) Our framework adapts an image-to-video diffusion model for zero-shot image editing through three components: (1) Prompt Enhancement via CoT, which reformulates static instructions into temporally grounded reasoning prompts; (2) Temporal Latent Dropout (TLD), which accelerates inference by sparsifying temporal latents while preserving motion consistency; and (3) Self-Consistent Post-Refinement, which selects the sharpest frame via Laplacian score and performs still-video refinement to enhance detail and stability. Together, these modules enable efficient, physically consistent, and instruction-aligned image editing.
  • Figure 5: Qualitative Results on TEdBench (§\ref{['sec:non-rigid evaluation']}). Comparison with other methods across non-rigid editing tasks.
  • ...and 4 more figures