Motion Modes: What Could Happen Next?
Karran Pandey, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy J. Mitra, Paul Guerrero
TL;DR
Motion Modes tackles the challenge of generating diverse, plausible object motions from a single image without training new models. It harnesses a pre-trained image-to-video diffusion backbone and applies inference-time guidance energies to steer motion generation toward object-centric motion while suppressing camera and scene changes, yielding multiple distinct motions. The approach introduces four energies—static-camera, object-motion, diversity, and smoothness—combined into a single guiding energy that enables sampling of a small set of focused motions, validated by qualitative, quantitative, and user studies that show improvements over baselines and even human predictions. Practically, Motion Modes supports drag-based editing and motion completion, with limitations tied to the underlying video prior and sampling speed, and points to future work on moving-camera and 3D motion extensions.
Abstract
Predicting diverse object motions from a single static image remains challenging, as current video generation models often entangle object movement with camera motion and other scene changes. While recent methods can predict specific motions from motion arrow input, they rely on synthetic data and predefined motions, limiting their application to complex scenes. We introduce Motion Modes, a training-free approach that explores a pre-trained image-to-video generator's latent distribution to discover various distinct and plausible motions focused on selected objects in static images. We achieve this by employing a flow generator guided by energy functions designed to disentangle object and camera motion. Additionally, we use an energy inspired by particle guidance to diversify the generated motions, without requiring explicit training data. Experimental results demonstrate that Motion Modes generates realistic and varied object animations, surpassing previous methods and even human predictions regarding plausibility and diversity. Project Webpage: https://motionmodes.github.io/
