Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
Daniel Geng, Andrew Owens
TL;DR
This work tackles the difficulty of performing precise structural edits with diffusion models by introducing motion guidance, a zero-shot technique that uses a user-specified dense optical flow field to steer diffusion sampling. By backpropagating through an off-the-shelf flow estimator, the method optimizes a joint loss that enforces the desired motion while preserving visual fidelity to the source, enabling complex deformations, disocclusions, and motion transfer without training or architecture changes. Key contributions include a differentiable flow-guided diffusion framework, occlusion- and mask-handling strategies, recursive denoising for stability, and comprehensive qualitative and quantitative evaluation against baselines. The approach broadens diffusion-based image editing to dense, pixel-level motion, with practical implications for editing real and synthetic images, and for future integration of differentiable vision priors into generative models.
Abstract
Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose motion guidance, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.
