Table of Contents
Fetching ...

Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender, Dinesh Manocha, Mohammad Babaeizadeh

TL;DR

This work tackles multi-concept video customization with pretrained text-to-video models by teaching the model the interactions among concepts through autoregressive generation. It finetunes a single adapter jointly on all concepts and then performs causal, one-subject-at-a-time generation using carefully structured prompts to guide interactions toward the intersection of concept video manifolds. The approach yields improved fidelity for multi-subject interactions across two-, three-, and single-concept scenarios, demonstrated with quantitative metrics (VideoCLIP, DINO) and human evaluations, and shows advantages over DreamBooth-style finetuning and baselines. The method enables compositional, controllable video generation with limited data per concept and points to future extensions toward higher resolutions, longer horizons, and richer interaction controls.

Abstract

We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented in this paper can be found at https://github.com/divyakraman/MultiConceptVideo2024.

Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

TL;DR

This work tackles multi-concept video customization with pretrained text-to-video models by teaching the model the interactions among concepts through autoregressive generation. It finetunes a single adapter jointly on all concepts and then performs causal, one-subject-at-a-time generation using carefully structured prompts to guide interactions toward the intersection of concept video manifolds. The approach yields improved fidelity for multi-subject interactions across two-, three-, and single-concept scenarios, demonstrated with quantitative metrics (VideoCLIP, DINO) and human evaluations, and shows advantages over DreamBooth-style finetuning and baselines. The method enables compositional, controllable video generation with limited data per concept and points to future extensions toward higher resolutions, longer horizons, and richer interaction controls.

Abstract

We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented in this paper can be found at https://github.com/divyakraman/MultiConceptVideo2024.
Paper Structure (26 sections, 1 equation, 16 figures)

This paper contains 26 sections, 1 equation, 16 figures.

Figures (16)

  • Figure 1: We present a method for multi-concept customization of text-to-video models. Our method only relies on a pre-trained text to video model and concept specific data in the form of image(s) or video(s). In the illustration above, our model uses an image of a teddy bear and ocean background (top row) and an image of a dog and video of playing violin (bottom row) to generate the customized videos.
  • Figure 2: Proposed method for multi-concept video customization. First, we add adapter layers to the transformer architecture in an autoregressive T2V model and finetune these additional layers on the given images or videos associated with the $N$ custom concepts and their corresponding text prompts. The goal is to find the solution at the intersection of the video manifolds corresponding to various custom concepts. Then, we condition on $m$ ($=5$) prior frames and sequentially generate the custom concepts and their interactions in a controlled manner using a set of prompts $p_{0} ... p_{N}$. The prompts $p_{0} ... p_{N}$ are designed to represent the scene in a top-down manner, each prompt adding a custom concept and the associated interaction.
  • Figure 3: Two subject customization. While Phenaki contains some prior knowledge of the interactions between different kinds of subjects, the finetuned model is unable to generate the interactions between the custom subjects. Through controlled and sequential autoregressive generation of concepts and their interactions, our method is able to generate customized videos with two subjects interacting with each other.
  • Figure 4: Subject Action customization. Phenaki, due to lack of prior knowledge, is unable to generate certain actions. The finetuned model is able to generate the custom subject, but not the precise action, even after finetuning. Generating the motion corresponding to the action first, followed by generating the custom subject performing action (conditioned on the motion) enables the generation of the custom subject performing the custom action. Thus, we are not only able to 'teach' the action to the model, but are also able to customize the subject performing the action.
  • Figure 5: Subject BG customization. While subject-background customization is relatively easy for the Phenaki model, the finetuned model is unable to emphasize on the background due to bias issues. This is resolved by our model, which first generates the background and then generates video frames of the custom subject performing action, conditioned on the background.
  • ...and 11 more figures