Table of Contents
Fetching ...

MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion

Shuyuan Tu, Qi Dai, Zihao Zhang, Sicheng Xie, Zhi-Qi Cheng, Chong Luo, Xintong Han, Zuxuan Wu, Yu-Gang Jiang

TL;DR

MotionFollower addresses the challenge of editing video motion while preserving background details and appearance by introducing two lightweight, convolution-based controllers for pose and appearance, avoiding heavy attention mechanisms.A novel score-guided inference with a two-branch architecture and segmentation-based regularizers enforces regional consistency between reconstruction and editing branches, steering denoising without updating model weights.The method achieves competitive motion editing performance with ~80% GPU memory reduction compared to MotionEditor and demonstrates robustness to long sequences and large camera movements.Extensive experiments, ablations, and qualitative/quantitative comparisons validate the approach and highlight its efficiency and effectiveness for motion-centric video editing.

Abstract

Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, MotionFollower leverages two of our proposed lightweight signal controllers, one for poses and the other for appearances, both of which consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture, including the reconstruction and editing branches, which significantly enhance the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers and losses during the score estimation. The resulting gradients thus inject appropriate guidance to the intermediate latents, forcing the model to preserve the original background details and protagonists' appearances without interfering with the motion modification. Experiments demonstrate the competitive motion editing ability of MotionFollower qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory while delivering superior motion editing performance and exclusively supporting large camera movements and actions.

MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion

TL;DR

MotionFollower addresses the challenge of editing video motion while preserving background details and appearance by introducing two lightweight, convolution-based controllers for pose and appearance, avoiding heavy attention mechanisms.A novel score-guided inference with a two-branch architecture and segmentation-based regularizers enforces regional consistency between reconstruction and editing branches, steering denoising without updating model weights.The method achieves competitive motion editing performance with ~80% GPU memory reduction compared to MotionEditor and demonstrates robustness to long sequences and large camera movements.Extensive experiments, ablations, and qualitative/quantitative comparisons validate the approach and highlight its efficiency and effectiveness for motion-centric video editing.

Abstract

Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, MotionFollower leverages two of our proposed lightweight signal controllers, one for poses and the other for appearances, both of which consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture, including the reconstruction and editing branches, which significantly enhance the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers and losses during the score estimation. The resulting gradients thus inject appropriate guidance to the intermediate latents, forcing the model to preserve the original background details and protagonists' appearances without interfering with the motion modification. Experiments demonstrate the competitive motion editing ability of MotionFollower qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory while delivering superior motion editing performance and exclusively supporting large camera movements and actions.
Paper Structure (22 sections, 13 equations, 18 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 13 equations, 18 figures, 4 tables, 1 algorithm.

Figures (18)

  • Figure 1: MotionFollower: a lightweight motion editing method for transferring motion from target video to source while keeping source background, protagonists' appearance, and camera movement.
  • Figure 2: Architecture overview. In training, two lightweight signal controllers and U-Net are trainable. The model is first trained with single frame (first stage), then with video clip (second stage). In inference, we build a two-branch structure, one for reconstruction and the other for editing. Score guidance is computed using features from the two branches, which is then used to update the latent.
  • Figure 3: Qualitative comparison between our MotionFollower and other state-of-the-art models. Our method exhibits accurate motion editing and appearance preservation.
  • Figure 4: Example illustration of ablation study on core components of the proposed MotionFollower.
  • Figure 5: The overview of our proposed person segmentation model.
  • ...and 13 more figures