Table of Contents
Fetching ...

Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance

Haoxuan Wang, Jiachen Tao, Junyi Wu, Gaowen Liu, Ramana Rao Kompella, Yan Yan

TL;DR

Motion Marionette tackles rigid motion transfer from a source monocular video to a single-view target image by introducing a spatial-temporal (SpaT) prior that is shared across objects and independent of absolute geometry. It lifts both source and target into a 3D Gaussian Splatting (3DGS) space, extracts motion trajectories, and derives a velocity field $\mathcal{V}(t,\mathcal{G})=\{\boldsymbol{v}_t\}_{t=1}^{T-1}$ to drive motion via Euler integration $\boldsymbol{\mu}_{t+1}=\boldsymbol{\mu}_t+\boldsymbol{v}_t$, with corrections from Position-Based Dynamics to maintain coherence. The SpaT prior is constructed through a two-stage process: dense foreground trajectory sampling and Umeyama-based rigid alignment to obtain $\mathbf{R}_t$ and $\boldsymbol{\delta}_t$, forming a transferable descriptor of relative spatial changes over time. The framework enables controllable video generation by manipulating the velocity field and camera poses, producing arbitrary-length sequences with diverse viewpoints while maintaining geometric consistency, and experiments show strong generalization across object types and temporal coherence compared to priors-based baselines.

Abstract

We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.

Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance

TL;DR

Motion Marionette tackles rigid motion transfer from a source monocular video to a single-view target image by introducing a spatial-temporal (SpaT) prior that is shared across objects and independent of absolute geometry. It lifts both source and target into a 3D Gaussian Splatting (3DGS) space, extracts motion trajectories, and derives a velocity field to drive motion via Euler integration , with corrections from Position-Based Dynamics to maintain coherence. The SpaT prior is constructed through a two-stage process: dense foreground trajectory sampling and Umeyama-based rigid alignment to obtain and , forming a transferable descriptor of relative spatial changes over time. The framework enables controllable video generation by manipulating the velocity field and camera poses, producing arbitrary-length sequences with diverse viewpoints while maintaining geometric consistency, and experiments show strong generalization across object types and temporal coherence compared to priors-based baselines.

Abstract

We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.

Paper Structure

This paper contains 25 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of Motion Marionette.(a) We lift both the source video and the target image into 3DGS representations. Motion trajectories are then extracted from the source video and used to construct the SpaT prior, which is integrated with the target object using Euler integration to produce a velocity field that guides motion transfer. (b) We patchify the velocity field and perform iterative optimization to mitigate error accumulation caused by the absence of supervision and the use of Euler integration. (c) The explicit velocity field can thus be flexibly utilized for efficient rendering of coherent videos and also enables controllable video generation.
  • Figure 2: Qualitative motion transfer results. Time progresses from left to right. Arrows in the leftmost column indicate the approximate motion direction in the source video.
  • Figure 3: Video visual quality evaluation.
  • Figure 3: Examples of controllable video generation. (a) shows control over camera poses for generating different views; (b) shows results of varying motion speed through velocity scaling.
  • Figure 4: Effect of the adopted losses.
  • ...and 1 more figures