Text Prompting for Multi-Concept Video Customization by Autoregressive Generation
Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender, Dinesh Manocha, Mohammad Babaeizadeh
TL;DR
This work tackles multi-concept video customization with pretrained text-to-video models by teaching the model the interactions among concepts through autoregressive generation. It finetunes a single adapter jointly on all concepts and then performs causal, one-subject-at-a-time generation using carefully structured prompts to guide interactions toward the intersection of concept video manifolds. The approach yields improved fidelity for multi-subject interactions across two-, three-, and single-concept scenarios, demonstrated with quantitative metrics (VideoCLIP, DINO) and human evaluations, and shows advantages over DreamBooth-style finetuning and baselines. The method enables compositional, controllable video generation with limited data per concept and points to future extensions toward higher resolutions, longer horizons, and richer interaction controls.
Abstract
We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented in this paper can be found at https://github.com/divyakraman/MultiConceptVideo2024.
