Table of Contents
Fetching ...

Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

Daniel Geng, Andrew Owens

TL;DR

This work tackles the difficulty of performing precise structural edits with diffusion models by introducing motion guidance, a zero-shot technique that uses a user-specified dense optical flow field to steer diffusion sampling. By backpropagating through an off-the-shelf flow estimator, the method optimizes a joint loss that enforces the desired motion while preserving visual fidelity to the source, enabling complex deformations, disocclusions, and motion transfer without training or architecture changes. Key contributions include a differentiable flow-guided diffusion framework, occlusion- and mask-handling strategies, recursive denoising for stability, and comprehensive qualitative and quantitative evaluation against baselines. The approach broadens diffusion-based image editing to dense, pixel-level motion, with practical implications for editing real and synthetic images, and for future integration of differentiable vision priors into generative models.

Abstract

Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose motion guidance, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.

Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

TL;DR

This work tackles the difficulty of performing precise structural edits with diffusion models by introducing motion guidance, a zero-shot technique that uses a user-specified dense optical flow field to steer diffusion sampling. By backpropagating through an off-the-shelf flow estimator, the method optimizes a joint loss that enforces the desired motion while preserving visual fidelity to the source, enabling complex deformations, disocclusions, and motion transfer without training or architecture changes. Key contributions include a differentiable flow-guided diffusion framework, occlusion- and mask-handling strategies, recursive denoising for stability, and comprehensive qualitative and quantitative evaluation against baselines. The approach broadens diffusion-based image editing to dense, pixel-level motion, with practical implications for editing real and synthetic images, and for future integration of differentiable vision priors into generative models.

Abstract

Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose motion guidance, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.
Paper Structure (53 sections, 5 equations, 19 figures)

This paper contains 53 sections, 5 equations, 19 figures.

Figures (19)

  • Figure 1: Flow Guidance. Given a source image and a target flow, we generate a new image that has the desired flow with respect to the original image. Our method is zero-shot, achieving this by performing guidance through an optical flow network, and works on both real and synthetic images. Note, qualitative results in the main body of this paper were automatically selected for, and random results can be found in Appendix \ref{['sec:additional_results']}.
  • Figure 2: Moving and Deforming Objects. We show various motion edits on a single source image (a), demonstrating our method can handle diverse deformations including scaling and stretching. We provide a legend for flow visualization in Figure \ref{['fig:teaser']}.
  • Figure 3: Ablations. We qualitatively ablate out techniques we use to achieve motion guidance. For a discussion please see Section \ref{['sec:ablations']}. We provide a legend for the flow visualization in Figure \ref{['fig:teaser']}.
  • Figure 4: Baselines. We show qualitative examples from various baselines and our method. The instruction used for InstructPix2Pix is shown beneath each InstructPix2Pix sample. For a discussion please see Section \ref{['sec:baselines']}. We provide a legend for the flow visualization in Figure \ref{['fig:teaser']}.
  • Figure 5: Comparison to DragGAN. DragGAN works only on domains for which a StyleGAN has been trained on. Attempts to edit real images that are out-of-domain, even if they are invertible, results in failures that our model handles well. Here we show results on StyleGANs trained for (a) elephants (b) lions (c) faces. We provide a legend for the flow visualization in Figure \ref{['fig:teaser']}.
  • ...and 14 more figures