Table of Contents
Fetching ...

Pathways on the Image Manifold: Image Editing via Video Generation

Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, Ron Kimmel

TL;DR

This work reframes image editing as a temporally coherent video generation task, introducing Frame2Frame (F2F) which guides a pretrained image-to-video model along a gradual path on the image manifold to realize edits while preserving source content. The method converts edits into Temporal Editing Captions via a Vision-Language Model, generates a temporally coherent edit sequence with CogVideoX, and automatically selects the best frame that matches the target instruction. Empirical results on the TEdBench and PosEdit benchmarks show state-of-the-art performance in edit accuracy and content preservation, complemented by a user study favoring F2F over competing methods. Beyond editing, F2F demonstrates promising results on classic vision tasks like denoising, deblurring, outpainting, and relighting, underscoring the broad potential of video-based transformations for image editing.

Abstract

Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation. Visit our project page at https://rotsteinnoam.github.io/Frame2Frame.

Pathways on the Image Manifold: Image Editing via Video Generation

TL;DR

This work reframes image editing as a temporally coherent video generation task, introducing Frame2Frame (F2F) which guides a pretrained image-to-video model along a gradual path on the image manifold to realize edits while preserving source content. The method converts edits into Temporal Editing Captions via a Vision-Language Model, generates a temporally coherent edit sequence with CogVideoX, and automatically selects the best frame that matches the target instruction. Empirical results on the TEdBench and PosEdit benchmarks show state-of-the-art performance in edit accuracy and content preservation, complemented by a user study favoring F2F over competing methods. Beyond editing, F2F demonstrates promising results on classic vision tasks like denoising, deblurring, outpainting, and relighting, underscoring the broad potential of video-based transformations for image editing.

Abstract

Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation. Visit our project page at https://rotsteinnoam.github.io/Frame2Frame.

Paper Structure

This paper contains 35 sections, 1 equation, 17 figures, 5 tables.

Figures (17)

  • Figure 2: Editing Manifold Pathway. Given an input image and target caption "A happy person making a heart shape with their hands", our method generates a continuous path on the natural image manifold. Each generated frame (indicated by black arrows) represents a plausible intermediate state between the source and target, maintaining temporal consistency throughout the transformation. As a result, in contrast to the competing approach, F2F achieves the desired edit while preserving the "AI" text on the person’s shirt.
  • Figure 3: Frame2Frame Overview. Given a source image and editing prompt, our pipeline proceeds in three steps. First, a Vision-Language Model generates a temporal caption describing the transformation. Next, this caption guides a video generator to create a natural progression of the edit. Finally, our frame selection strategy identifies the optimal frame that best realizes the desired edit, producing the final image of the cat mid-leap.
  • Figure 4: Qualitative Results on TEdBench. Comparison with other methods across various editing tasks. Our approach consistently produces edits that better align with the target prompt while preserving the source image's content and structure. For instance, in the teddy bear example, our method uniquely achieves complex structural modifications while maintaining high visual quality.
  • Figure 5: Qualitative Results on PosEdit. Comparison between our Frame2Frame method and LEDITS++ on human motion editing tasks. For each example, we show the source image, edited results from both methods, and the ground-truth target image. Our method better preserves subject identity while achieving more natural pose transitions. The evaluation metrics for each image are provided in Section \ref{['supp:posedit']} of the appendix.
  • Figure 6: Additional Vision Tasks. Qualitative results of our image-to-video-to-image editing approach on selected traditional tasks.
  • ...and 12 more figures