Table of Contents
Fetching ...

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Yi Zuo, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Shuyuan Yang, Yuwei Guo

TL;DR

This paper utilized DDIM inversion to initialize the noise, preserving the appearance of the source video, designed a lightweight motion attention adapter module to enhance motion fidelity and designed a spatio-temporal two-stage learning strategy (STL).

Abstract

Existing diffusion-based methods have achieved impressive results in human motion editing. However, these methods often exhibit significant ghosting and body distortion in unseen in-the-wild cases. In this paper, we introduce Edit-Your-Motion, a video motion editing method that tackles these challenges through one-shot fine-tuning on unseen cases. Specifically, firstly, we utilized DDIM inversion to initialize the noise, preserving the appearance of the source video and designed a lightweight motion attention adapter module to enhance motion fidelity. DDIM inversion aims to obtain the implicit representations by estimating the prediction noise from the source video, which serves as a starting point for the sampling process, ensuring the appearance consistency between the source and edited videos. The Motion Attention Module (MA) enhances the model's motion editing ability by resolving the conflict between the skeleton features and the appearance features. Secondly, to effectively decouple motion and appearance of source video, we design a spatio-temporal two-stage learning strategy (STL). In the first stage, we focus on learning temporal features of human motion and propose recurrent causal attention (RCA) to ensure consistency between video frames. In the second stage, we shift focus on learning the appearance features of the source video. With Edit-Your-Motion, users can edit the motion of humans in the source video, creating more engaging and diverse content. Extensive qualitative and quantitative experiments, along with user preference studies, show that Edit-Your-Motion outperforms other methods.

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

TL;DR

This paper utilized DDIM inversion to initialize the noise, preserving the appearance of the source video, designed a lightweight motion attention adapter module to enhance motion fidelity and designed a spatio-temporal two-stage learning strategy (STL).

Abstract

Existing diffusion-based methods have achieved impressive results in human motion editing. However, these methods often exhibit significant ghosting and body distortion in unseen in-the-wild cases. In this paper, we introduce Edit-Your-Motion, a video motion editing method that tackles these challenges through one-shot fine-tuning on unseen cases. Specifically, firstly, we utilized DDIM inversion to initialize the noise, preserving the appearance of the source video and designed a lightweight motion attention adapter module to enhance motion fidelity. DDIM inversion aims to obtain the implicit representations by estimating the prediction noise from the source video, which serves as a starting point for the sampling process, ensuring the appearance consistency between the source and edited videos. The Motion Attention Module (MA) enhances the model's motion editing ability by resolving the conflict between the skeleton features and the appearance features. Secondly, to effectively decouple motion and appearance of source video, we design a spatio-temporal two-stage learning strategy (STL). In the first stage, we focus on learning temporal features of human motion and propose recurrent causal attention (RCA) to ensure consistency between video frames. In the second stage, we shift focus on learning the appearance features of the source video. With Edit-Your-Motion, users can edit the motion of humans in the source video, creating more engaging and diverse content. Extensive qualitative and quantitative experiments, along with user preference studies, show that Edit-Your-Motion outperforms other methods.
Paper Structure (16 sections, 18 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 18 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Given the source video (or image) and the reference skeletons, the results generated by the different methods. Red boxes highlight inconsistencies in appearance with the source video, including distortions and ghosting.
  • Figure 2: The overall pipeline of Edit-Your-Motion. We employ DDIM inversion to preserve the appearance of the source video and introduce motion attention module to resolve conflicts between skeleton and appearance features. Additionally, we replace spatial attention with recurrent causal attention to enhance inter-frame connections. Finally, to improve the feature extraction capabilities of each module, we design a spatio-temporal decoupling two-stage training strategy that requires only a fewer training iterations.
  • Figure 3: The role of DDIM inversion. The noise obtained by DDIM inversion, directly passed through U-Net, still retains most of the structural features of the source video.
  • Figure 4: The structure of motion attention module. It consists of a self attention, a cross attention, and a temporal attention, which can mitigate the conflict between skeleton and appearance features extracted by ControlNet and U-Net.
  • Figure 5: The structure of recurrent causal attention. It directly connects the previous frame to the next, thereby enhancing the consistency of the video.
  • ...and 4 more figures