Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance
Haoxuan Wang, Jiachen Tao, Junyi Wu, Gaowen Liu, Ramana Rao Kompella, Yan Yan
TL;DR
Motion Marionette tackles rigid motion transfer from a source monocular video to a single-view target image by introducing a spatial-temporal (SpaT) prior that is shared across objects and independent of absolute geometry. It lifts both source and target into a 3D Gaussian Splatting (3DGS) space, extracts motion trajectories, and derives a velocity field $\mathcal{V}(t,\mathcal{G})=\{\boldsymbol{v}_t\}_{t=1}^{T-1}$ to drive motion via Euler integration $\boldsymbol{\mu}_{t+1}=\boldsymbol{\mu}_t+\boldsymbol{v}_t$, with corrections from Position-Based Dynamics to maintain coherence. The SpaT prior is constructed through a two-stage process: dense foreground trajectory sampling and Umeyama-based rigid alignment to obtain $\mathbf{R}_t$ and $\boldsymbol{\delta}_t$, forming a transferable descriptor of relative spatial changes over time. The framework enables controllable video generation by manipulating the velocity field and camera poses, producing arbitrary-length sequences with diverse viewpoints while maintaining geometric consistency, and experiments show strong generalization across object types and temporal coherence compared to priors-based baselines.
Abstract
We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.
