Table of Contents
Fetching ...

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava

TL;DR

This work tackles one-shot video motion customization for text-to-video diffusion models by learning a motion signature from a single reference video. It introduces Temporal LoRA to model temporal dynamics and Appearance Absorbers to disentangle spatial appearance from motion, employing a staged training/inference pipeline for robust, diverse motion transfer to new subjects and scenes. Experiments show faithful motion reproduction and rich variation, outperforming baselines and concurrent methods on both quantitative metrics and human judgments. The framework supports downstream tasks such as video appearance customization, multiple motion combination, and reuse of third-party absorbers, offering a plug-and-play approach to motion-aware video generation and editing.

Abstract

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot video motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. To disentangle the spatial and temporal information during training, we introduce a novel concept of appearance absorbers that detach the original appearance from the reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination. Our project page can be found at https://customize-a-video.github.io.

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

TL;DR

This work tackles one-shot video motion customization for text-to-video diffusion models by learning a motion signature from a single reference video. It introduces Temporal LoRA to model temporal dynamics and Appearance Absorbers to disentangle spatial appearance from motion, employing a staged training/inference pipeline for robust, diverse motion transfer to new subjects and scenes. Experiments show faithful motion reproduction and rich variation, outperforming baselines and concurrent methods on both quantitative metrics and human judgments. The framework supports downstream tasks such as video appearance customization, multiple motion combination, and reuse of third-party absorbers, offering a plug-and-play approach to motion-aware video generation and editing.

Abstract

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot video motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. To disentangle the spatial and temporal information during training, we introduce a novel concept of appearance absorbers that detach the original appearance from the reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination. Our project page can be found at https://customize-a-video.github.io.
Paper Structure (50 sections, 6 equations, 12 figures, 3 tables)

This paper contains 50 sections, 6 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Customize-A-Video takes as input a single reference video (top left) and transfers its motion onto new generated videos with plausible variance. (1) Transferring the dancing twirling from the lady onto Ironman with two random output variants. (2) Transferring the motion onto multiple subjects. (3) Combining multiple motion customization together, i.e., both dancing twirling and with aerial camera flight over. (4) Combining proposed motion customization and existing image customization methods (Kappa_Neuro in the example) to support both appearance and motion customization.
  • Figure 2: Our Temporal LoRA, Appearance Absorbers and their training and inference processes. All noise and denoising schedules are omitted for simplicity. (1) We bypass all temporal layers in a base T2V diffusion model and apply appearance absorber such as S-LoRA or Textual Inversion on its spatial attention layers. The module is trained on unordered video frames. (2) We apply T-LoRA on all temporal attentions in the full base T2V model. The trained appearance absorber is also loaded and frozen. The module is trained on the target video. (3) During inference, only the trained T-LoRA is loaded. A new video with the customized motion is generated by a prompt describing the new appearance and the target motion.
  • Figure 3: Results of one-shot motion customization. (1-left) Reference video. (2-left) ModelScope wang2023modelscope fails to transfer the reference motion faithfully with only text guidance. (1-right & 2-right) Tune-A-Video wu2023tune and Video-P2P liu2023video rely on DDIM inverted latent input and duplicate the original frame structure deterministically. (3) Concurrent work MotionDirector zhao2023motiondirector also generates various output following the reference motion while there exist some appearance and motion artifacts especially for hard examples with complex or intensive movements. (4) Our methods generate motion with both accuracy and variety in details such as view perspective and frame layout. Two variants generated with random noise are shown for MotionDirector and Ours.
  • Figure 4: Left: Ablations on applying LoRAs on different attention layers. S-LoRA memorizes the indoor furniture and wall decorations and T-LoRA converts paintings to entrances and sofas to pool benches. Right: Ablations on training T-LoRAs with different types of appearance absorbers. No AA adds stylish glasses and the logo but remains most the original appearance. S-LoRA AA and TextInv AA significantly boost the quality while resulting in the strips on the wall and the partially white sleeves. Dual AA reaches best spatial clearance with clear costume and background.
  • Figure 5: Left: Video appearance customization with both T-LoRA and existing pre-trained S-LoRA (from tungdop2). Right: Multiple motion combination with two T-LoRAs loaded at the same time. When the robot is slowly jogging, it fast zooms in while the background trees rapidly zoom out (dolly zoom).
  • ...and 7 more figures