Table of Contents
Fetching ...

Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov

TL;DR

This work tackles open-set, multi-subject video personalization without per-subject test-time optimization by introducing Video Alchemist, a latent Diffusion Transformer that fuses text prompts with per-subject reference images through subject-level embeddings and dual cross-attention. It addresses dataset and evaluation challenges with an automatic data construction pipeline and the MSRVTT-Personalization benchmark, respectively. Through data augmentation and careful binding of image and word concepts, it reduces overfitting and copy-paste artifacts while delivering high subject fidelity and natural motion. Empirical results show superior performance over existing methods across quantitative metrics and human studies, highlighting the practicality of open-set personalized video generation across diverse contexts.

Abstract

Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.

Multi-subject Open-set Personalization in Video Generation

TL;DR

This work tackles open-set, multi-subject video personalization without per-subject test-time optimization by introducing Video Alchemist, a latent Diffusion Transformer that fuses text prompts with per-subject reference images through subject-level embeddings and dual cross-attention. It addresses dataset and evaluation challenges with an automatic data construction pipeline and the MSRVTT-Personalization benchmark, respectively. Through data augmentation and careful binding of image and word concepts, it reduces overfitting and copy-paste artifacts while delivering high subject fidelity and natural motion. Empirical results show superior performance over existing methods across quantitative metrics and human studies, highlighting the practicality of open-set personalized video generation across diverse contexts.

Abstract

Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
Paper Structure (24 sections, 2 equations, 17 figures, 6 tables)

This paper contains 24 sections, 2 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Dataset collection pipeline for video personalization. We construct our training dataset using video and caption pairs through three steps. First, we identify three categories of entity words from the caption: subject, object, and background. Next, we use these entity words to localize and segment the target subjects and objects in three selected video frames. Finally, we extract a clean background image by removing the subjects and objects from the middle frame.
  • Figure 2: Model architecture. Our model is a latent DiT dit, where we first encode a video into video tokens and denoise them with a deep cascade of DiT blocks in the latent space. Each DiT block includes an additional cross-attention operation with personalization embeddings $f = \textrm{Concat}(f_1, \dots, f_n, \dots, f_N)$, where $f_n$ fuses the embeddings of both the reference image $x_n$ and the corresponding entity word $c_n$. Each square in the figure represents a 1-D token.
  • Figure 3: Test sample in MSRVTT-Personalization. We present a comprehensive video personalization benchmark. Our benchmark supports various modes, including face conditioning, single or multiple subjects conditioning, and foreground and background conditioning.
  • Figure 4: Qualitative comparison on MSRVTT-Personalization. We use a single reference image to each model for a fair comparison. Compared to existing methods, our results closely match the input text prompt and reference subjects while exhibiting natural motion and pose variations.
  • Figure 5: Qualitative results of the ablation study. From top to bottom, we show that 1) Video Alchemist achieves better subject fidelity using DINOv2 dino_v2 as the image encoder; 2) it correctly binds the conditional image and entity word with the usage of word tokens; 3) it mitigates the copy-and-paste effect and synthesizes text-aligned videos via the proposed data augmentation. The reference image is synthesized by DALL·E 3 dalle3.
  • ...and 12 more figures