Table of Contents
Fetching ...

Motion Modes: What Could Happen Next?

Karran Pandey, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy J. Mitra, Paul Guerrero

TL;DR

Motion Modes tackles the challenge of generating diverse, plausible object motions from a single image without training new models. It harnesses a pre-trained image-to-video diffusion backbone and applies inference-time guidance energies to steer motion generation toward object-centric motion while suppressing camera and scene changes, yielding multiple distinct motions. The approach introduces four energies—static-camera, object-motion, diversity, and smoothness—combined into a single guiding energy that enables sampling of a small set of focused motions, validated by qualitative, quantitative, and user studies that show improvements over baselines and even human predictions. Practically, Motion Modes supports drag-based editing and motion completion, with limitations tied to the underlying video prior and sampling speed, and points to future work on moving-camera and 3D motion extensions.

Abstract

Predicting diverse object motions from a single static image remains challenging, as current video generation models often entangle object movement with camera motion and other scene changes. While recent methods can predict specific motions from motion arrow input, they rely on synthetic data and predefined motions, limiting their application to complex scenes. We introduce Motion Modes, a training-free approach that explores a pre-trained image-to-video generator's latent distribution to discover various distinct and plausible motions focused on selected objects in static images. We achieve this by employing a flow generator guided by energy functions designed to disentangle object and camera motion. Additionally, we use an energy inspired by particle guidance to diversify the generated motions, without requiring explicit training data. Experimental results demonstrate that Motion Modes generates realistic and varied object animations, surpassing previous methods and even human predictions regarding plausibility and diversity. Project Webpage: https://motionmodes.github.io/

Motion Modes: What Could Happen Next?

TL;DR

Motion Modes tackles the challenge of generating diverse, plausible object motions from a single image without training new models. It harnesses a pre-trained image-to-video diffusion backbone and applies inference-time guidance energies to steer motion generation toward object-centric motion while suppressing camera and scene changes, yielding multiple distinct motions. The approach introduces four energies—static-camera, object-motion, diversity, and smoothness—combined into a single guiding energy that enables sampling of a small set of focused motions, validated by qualitative, quantitative, and user studies that show improvements over baselines and even human predictions. Practically, Motion Modes supports drag-based editing and motion completion, with limitations tied to the underlying video prior and sampling speed, and points to future work on moving-camera and 3D motion extensions.

Abstract

Predicting diverse object motions from a single static image remains challenging, as current video generation models often entangle object movement with camera motion and other scene changes. While recent methods can predict specific motions from motion arrow input, they rely on synthetic data and predefined motions, limiting their application to complex scenes. We introduce Motion Modes, a training-free approach that explores a pre-trained image-to-video generator's latent distribution to discover various distinct and plausible motions focused on selected objects in static images. We achieve this by employing a flow generator guided by energy functions designed to disentangle object and camera motion. Additionally, we use an energy inspired by particle guidance to diversify the generated motions, without requiring explicit training data. Experimental results demonstrate that Motion Modes generates realistic and varied object animations, surpassing previous methods and even human predictions regarding plausibility and diversity. Project Webpage: https://motionmodes.github.io/

Paper Structure

This paper contains 24 sections, 12 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Could you imagine how the scene evolves in each case? See \ref{['fig:teaser_part2']} for plausible yet distinct motion videos predicted by our training-free approach Motion Modes.
  • Figure 2: Motion Modes creates multiple distinct and plausible motions for a given object, disentangled from the motion of other objects, camera and other scene changes. We show three distinct object motions for each of Figure \ref{['fig:teaser']}'s images, representative of constrained rigid motion (latop), complex deformations (wave) and articulated characters (lion and cat). We visualize motions as flow trajectories from blue (first frame) to red (last frame). Ghosted intermediate frames further clarify complex motions. See supplemental for the result videos.
  • Figure 3: Method Overview. We generate a motion $\mathbf x$ using a guided denoising approach, where guidance energies encourage smooth object motions that are disentangled from camera motions and distinct from previously generated motions. Iterative sampling gives us a set of diverse motions $\mathcal{X}$.
  • Figure 4: User Study I. We compare the plausible, diverse, and expected nature of our motions to four baselines. Each pair of bars shows the percentage of comparisons in which our method or a baseline was judged favorably with $95$% confidence intervals.
  • Figure 5: Qualitative comparison. Each column shows the first three motions for the masked object in the input (left). Object trajectories have red endpoints, background trajectories (usually due to camera motion) are purple. Motion is additionally visualized by overlaying ghosted intermediate frames. We can see that Motion Modes finds more plausible and diverse object motions disentangled from any other motions or scene changes, such as camera motions.
  • ...and 4 more figures