Table of Contents
Fetching ...

ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

Jiayi Gao, Zijin Yin, Changcheng Hua, Yuxin Peng, Kongming Liang, Zhanyu Ma, Jun Guo, Yang Liu

TL;DR

ConMo tackles zero-shot motion transfer in text-to-video by decoupling reference-motion signals into per-subject and background cues using subject masks and recombining them with soft guidance. It introduces a two-stage pipeline: motion disentanglement via Local Spatial Marginal Means (LSMM) and per-subject isolation, followed by motion recomposition in a diffusion-based generator with a soft-weighted blend $\Delta^{(i,j)}_{s^*_k} = \frac{\Delta^{(i,j)}_{s_k}+w_c \Delta^{(i,j)}_{c}}{w_c+1}$ to enable flexible shape changes. The method requires no additional training and enables broad applications such as semantic edits, size/position control, object removal, and camera-motion simulation, achieving superior motion fidelity and semantic consistency over state-of-the-art baselines. Extensive experiments on a multi-video dataset with qualitative, quantitative, and user-study evaluations demonstrate strong gains in multi-subject motion retention and prompt alignment, validating the effectiveness of motion disentanglement and soft-guided recomposition in complex scenes.

Abstract

The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce \textbf{ConMo}, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at https://github.com/Andyplus1/ConMo.

ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

TL;DR

ConMo tackles zero-shot motion transfer in text-to-video by decoupling reference-motion signals into per-subject and background cues using subject masks and recombining them with soft guidance. It introduces a two-stage pipeline: motion disentanglement via Local Spatial Marginal Means (LSMM) and per-subject isolation, followed by motion recomposition in a diffusion-based generator with a soft-weighted blend to enable flexible shape changes. The method requires no additional training and enables broad applications such as semantic edits, size/position control, object removal, and camera-motion simulation, achieving superior motion fidelity and semantic consistency over state-of-the-art baselines. Extensive experiments on a multi-video dataset with qualitative, quantitative, and user-study evaluations demonstrate strong gains in multi-subject motion retention and prompt alignment, validating the effectiveness of motion disentanglement and soft-guided recomposition in complex scenes.

Abstract

The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce \textbf{ConMo}, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at https://github.com/Andyplus1/ConMo.

Paper Structure

This paper contains 18 sections, 6 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: We propose ConMo to achieve various motion transfer applications: (a) multi-subjects motion transfer, (b) subject semantic/category change, (c) subject size editing, (d) subject position editing, (e) object remove and (f) camera motion simulation.(Green text indicates major changes.)
  • Figure 2: Overview of ConMo. The method mainly consists of two stages: (a) Reference Video's Motion Disentanglement Stage: We first acquire the masks for each subject in the reference video using SAM2sam2 and video latent features acquire during DDIM inversionsong2020denoising. Then, based on the mask, we identify the motion regions of each subject across different frames in the reference video. By calculating the difference of local spatial marginal means of latent features in these regions, we disentangle each subject’s motion. (b) Motion Recomposition for Target Video Generation Stage: The extracted motion is integrated into the initial noise via the Motion Guidance function and Soft Guidance strategy. This allows generating target videos with consistent motion and adaptive shape handling. The method supports various video editing effects like semantic changes, object removal, position editing, and camera simulation.
  • Figure 3: Qualitative Evaluation of multiple subjects motion transfer. Our method achieves better results in term of text alignment and multi-subject motion fidelity.
  • Figure 4: Qualitative evaluation of motion transfer with drastic semantic and shape alteration. Our method outperforms other methods when subject shape changes are notable.
  • Figure 5: Controllable Motion Granularity. Comparison of motion transfer across vehicle types with varying shape alterations. As background motion weight increases, original shape details diminish and alignment with prompts improves.
  • ...and 10 more figures