Table of Contents
Fetching ...

VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, Yu-Chiang Frank Wang

TL;DR

VideoMage addresses the challenge of jointly customizing multiple subjects and their interactive motions in text-to-video diffusion. It introduces subject and motion LoRAs plus an appearance-agnostic motion learning objective with negative guidance to disentangle motion from appearance, and a spatial-temporal collaborative sampling framework to fuse multi-subject information with motion patterns. The approach yields coherent, accurately labeled subjects that follow complex interactions, outperforming prior single-subject motion methods on both qualitative and quantitative criteria, including user studies. This framework enables practical, controllable multi-subject video generation with explicit handling of cross-subject interactions in dynamic scenes.

Abstract

Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.

VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

TL;DR

VideoMage addresses the challenge of jointly customizing multiple subjects and their interactive motions in text-to-video diffusion. It introduces subject and motion LoRAs plus an appearance-agnostic motion learning objective with negative guidance to disentangle motion from appearance, and a spatial-temporal collaborative sampling framework to fuse multi-subject information with motion patterns. The approach yields coherent, accurately labeled subjects that follow complex interactions, outperforming prior single-subject motion methods on both qualitative and quantitative criteria, including user studies. This framework enables practical, controllable multi-subject video generation with explicit handling of cross-subject interactions in dynamic scenes.

Abstract

Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.

Paper Structure

This paper contains 42 sections, 11 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of VideoMage. (a) Given images of multiple subjects and a reference video with desirable motion, VideoMage advances LoRAs to capture the knowledge of visual appearances and appearance-agnostic motion information, respectively. (b) With a text prompt relating the aforementioned visual and motion concepts, our spatial-temporal collaborative composition refines the input noisy latent $x_t$ for generating videos matching the desirable visual and motion information.
  • Figure 2: Appearance-agnostic motion learning. By utilizing text prompt emphasizing the appearance information (i.e., $c_{\text{ap}}$), we aim to extract appearance-agnostic motion information via the proposed negative classifier-free guidance.
  • Figure 3: Spatial-temporal collaborative composition for T2V test-time optimization. (a) Test-time fusion of subject LoRAs $\hat{\theta}_s$, which employs attention regularization $\mathcal{L}_{attn}$ to ensure appearance preservation of each visual subject. (b) Spatiotemporal Collaborative Sampling (SCS) integrates the fused subject LoRA $\hat{\theta}_s$ and the motion LoRA $\theta_m$ by cross-modal alignment, ensuring visual and temporal coherence.
  • Figure 4: Qualitative comparisons of different customization methods. The subject images and the reference motion video are listed at the top of the figure. DV and MD refer to DreamVideo wei2024dreamvideo and MotionDirector zhao2023motiondirector, respectively. Please refer to the supplementary materials for the complete input prompts used for customization (e.g., describing the background, etc.).
  • Figure 5: Human preference study. Our VideoMage consistently achieves the best human preference compared to DreamVideo wei2024dreamvideo and MotionDirector wu2024motionbooth.
  • ...and 3 more figures