Table of Contents
Fetching ...

MotionEditor: Editing Video Motion via Content-Aware Diffusion

Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, Yu-Gang Jiang

TL;DR

<3-5 sentence high-level summary> MotionEditor tackles the challenging problem of editing video motion while preserving the original appearance and background. It introduces a content-aware motion adapter to ControlNet, a high-fidelity attention injection mechanism within a two-branch reconstruction/editing framework, and a skeleton-signal alignment module to ensure pose compatibility, all within a diffusion-based video editing pipeline. The approach yields superior motion fidelity and appearance preservation, demonstrated through qualitative and quantitative results and ablations. This work enables more reliable, reference-guided motion editing in real-world videos with temporal consistency.

Abstract

Existing diffusion-based video editing models have made gorgeous advances for editing attributes of a source video over time but struggle to manipulate the motion information while preserving the original protagonist's appearance and background. To address this, we propose MotionEditor, a diffusion model for video motion editing. MotionEditor incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence. While ControlNet enables direct generation based on skeleton poses, it encounters challenges when modifying the source motion in the inverted noise due to contradictory signals between the noise (source) and the condition (reference). Our adapter complements ControlNet by involving source content to transfer adapted control signals seamlessly. Further, we build up a two-branch architecture (a reconstruction branch and an editing branch) with a high-fidelity attention injection mechanism facilitating branch interaction. This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance. We also propose a skeleton alignment algorithm to address the discrepancies in pose size and position. Experiments demonstrate the promising motion editing ability of MotionEditor, both qualitatively and quantitatively.

MotionEditor: Editing Video Motion via Content-Aware Diffusion

TL;DR

<3-5 sentence high-level summary> MotionEditor tackles the challenging problem of editing video motion while preserving the original appearance and background. It introduces a content-aware motion adapter to ControlNet, a high-fidelity attention injection mechanism within a two-branch reconstruction/editing framework, and a skeleton-signal alignment module to ensure pose compatibility, all within a diffusion-based video editing pipeline. The approach yields superior motion fidelity and appearance preservation, demonstrated through qualitative and quantitative results and ablations. This work enables more reliable, reference-guided motion editing in real-world videos with temporal consistency.

Abstract

Existing diffusion-based video editing models have made gorgeous advances for editing attributes of a source video over time but struggle to manipulate the motion information while preserving the original protagonist's appearance and background. To address this, we propose MotionEditor, a diffusion model for video motion editing. MotionEditor incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence. While ControlNet enables direct generation based on skeleton poses, it encounters challenges when modifying the source motion in the inverted noise due to contradictory signals between the noise (source) and the condition (reference). Our adapter complements ControlNet by involving source content to transfer adapted control signals seamlessly. Further, we build up a two-branch architecture (a reconstruction branch and an editing branch) with a high-fidelity attention injection mechanism facilitating branch interaction. This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance. We also propose a skeleton alignment algorithm to address the discrepancies in pose size and position. Experiments demonstrate the promising motion editing ability of MotionEditor, both qualitatively and quantitatively.
Paper Structure (20 sections, 8 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: MotionEditor: A diffusion-based video editing method aimed at transferring motion from a reference to a source.
  • Figure 2: Architecture overview of MotionEditor. In training, only the motion adapter and temporal attention in U-Net are trainable. In inference, we first align the source and reference skeletons through resizing and translation. We then build a two-branch framework: one for reconstruction and the other for editing. Motion adapter enhances the motion guidance of ControlNet by leveraging the information from the source latent. We also inject the key/value in the reconstruction branch into the editing branch to preserve the source appearance.
  • Figure 3: Illustration of high-fidelity attention injection during inference. We leverage the source foreground masks to guide the decoupling of key/value in the Consistent-Sparse Attention.
  • Figure 4: Motion editing results of our MotionEditor. More examples can be found in the appendix.
  • Figure 5: Qualitative comparison between our MotionEditor and other state-of-the-art video editing models. Source prompt: "a girl in a black dress is dancing." Target prompt: "a girl in a black dress is practicing tai chi." Our method exhibits accurate motion editing and appearance preservation.
  • ...and 10 more figures