VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, Yu-Chiang Frank Wang
TL;DR
VideoMage addresses the challenge of jointly customizing multiple subjects and their interactive motions in text-to-video diffusion. It introduces subject and motion LoRAs plus an appearance-agnostic motion learning objective with negative guidance to disentangle motion from appearance, and a spatial-temporal collaborative sampling framework to fuse multi-subject information with motion patterns. The approach yields coherent, accurately labeled subjects that follow complex interactions, outperforming prior single-subject motion methods on both qualitative and quantitative criteria, including user studies. This framework enables practical, controllable multi-subject video generation with explicit handling of cross-subject interactions in dynamic scenes.
Abstract
Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
