Table of Contents
Fetching ...

Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts

Feng Liang, Haoyu Ma, Zecheng He, Tingbo Hou, Ji Hou, Kunpeng Li, Xiaoliang Dai, Felix Juefei-Xu, Samaneh Azadi, Animesh Sinha, Peizhao Zhang, Peter Vajda, Diana Marculescu

TL;DR

Movie Weaver tackles multi-concept video personalization without tuning, addressing identity blending by explicitly linking each concept description to its corresponding reference image via anchored prompts and by encoding reference order with concept embeddings. A data-curation pipeline assembles a large, diverse 230K-video dataset across configurations, enabling tuning-free training starting from a single-face baseline. Empirical results show superior identity preservation and visual quality over baselines such as Vidu 1.5, with strong ablations confirming the effectiveness of anchored prompts and concept embeddings and benefits from mixed training. This approach enables flexible composition of face, body, and animal references while preserving distinct identities, offering practical potential for personalized video generation in real-world applications.

Abstract

Video personalization, which generates customized videos using reference images, has gained significant attention. However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration. Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources. This challenge arises due to the lack of a mechanism to link each concept with its specific reference image. We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation. Additionally, we introduce concept embeddings to encode the order of reference images. Our approach, Movie Weaver, seamlessly weaves multiple concepts-including face, body, and animal images-into one video, allowing flexible combinations in a single model. The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.

Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts

TL;DR

Movie Weaver tackles multi-concept video personalization without tuning, addressing identity blending by explicitly linking each concept description to its corresponding reference image via anchored prompts and by encoding reference order with concept embeddings. A data-curation pipeline assembles a large, diverse 230K-video dataset across configurations, enabling tuning-free training starting from a single-face baseline. Empirical results show superior identity preservation and visual quality over baselines such as Vidu 1.5, with strong ablations confirming the effectiveness of anchored prompts and concept embeddings and benefits from mixed training. This approach enables flexible composition of face, body, and animal references while preserving distinct identities, offering practical potential for personalized video generation in real-world applications.

Abstract

Video personalization, which generates customized videos using reference images, has gained significant attention. However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration. Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources. This challenge arises due to the lack of a mechanism to link each concept with its specific reference image. We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation. Additionally, we introduce concept embeddings to encode the order of reference images. Our approach, Movie Weaver, seamlessly weaves multiple concepts-including face, body, and animal images-into one video, allowing flexible combinations in a single model. The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.

Paper Structure

This paper contains 37 sections, 2 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: We introduce Movie Weaver, a video diffusion model for personalized multi-concept video creation. Besides text prompts, our model allows users to input different combinations of reference images, e.g., face, body, and animal images, to customize videos in a tuning-free manner. The left column displays different types of reference images, while the right column shows the generated videos, with anchored prompt listed beneath each video. We encourage readers to check our video results in the supplementary materials.
  • Figure 2: Single-concept personalization architecture. Building on a pre-trained text-to-video model, this approach adds an image encoder to process reference images. Image and text tokens are concatenated and fed into cross attention layer.
  • Figure 3: Identity blending generates composite faces with characteristics from both references. Text prompt: "A woman in wheelchair discussing with a woman nurse."
  • Figure 4: (a) Data curation. For a video-text pair, ① concept descriptions and anchored prompts are generated via in-context learning with Llama-3. After ② extracting body masks, ③ CLIP links each concept to its corresponding image. ④ Finally, face images are obtained using a face segmentation model. (b) Movie Weaver architecture. Compared to the single-concept baseline, reference images are arranged in a specific order for concept embedding, and anchored prompts are utilized. Shared components are omitted for simplicity.
  • Figure 5: Qualitative results of Movie Weaver. Movie Weaver supports different combinations of reference images and can generate high-quality videos with high identity preservation. We encourage readers to check our video results in the supplementary materials.
  • ...and 5 more figures