Table of Contents
Fetching ...

VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu

TL;DR

VideoBooth tackles the challenge of image-prompt guided video generation by integrating a coarse visual embedding from a CLIP-based image encoder with a fine-grained, multi-scale attention-injection mechanism into a pretrained diffusion-based text-to-video model. The two-branch, coarse-to-fine design preserves the appearance specified by image prompts and ensures temporal consistency across frames without any inference-time finetuning. A dedicated WebVid-derived VideoBooth dataset supports training and evaluation, demonstrating superior image alignment while maintaining competitive text alignment, across both quantitative metrics and user studies. The work highlights practical, tuning-free video synthesis with clear pathways for future improvements and discusses societal implications and data handling considerations.

Abstract

Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.

VideoBooth: Diffusion-based Video Generation with Image Prompts

TL;DR

VideoBooth tackles the challenge of image-prompt guided video generation by integrating a coarse visual embedding from a CLIP-based image encoder with a fine-grained, multi-scale attention-injection mechanism into a pretrained diffusion-based text-to-video model. The two-branch, coarse-to-fine design preserves the appearance specified by image prompts and ensures temporal consistency across frames without any inference-time finetuning. A dedicated WebVid-derived VideoBooth dataset supports training and evaluation, demonstrating superior image alignment while maintaining competitive text alignment, across both quantitative metrics and user studies. The work highlights practical, tuning-free video synthesis with clear pathways for future improvements and discusses societal implications and data handling considerations.

Abstract

Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.
Paper Structure (18 sections, 6 equations, 9 figures, 2 tables)

This paper contains 18 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Videos synthesized by image prompts. Our VideoBooth generates videos with the subjects specified in the image prompts.
  • Figure 2: The Use of Image Prompts. We generate three video clips using different types of prompts: simple text prompt, long text prompt, and image prompt. We use the LLaVa model liu2023llava to generate a text prompt describing the appearance of the image prompt. Using text prompts alone cannot fully capture the visual characteristics of the image prompt.
  • Figure 3: Overview of VideoBooth. VideoBooth generates videos by taking image prompts $I$ and text prompts $T$ as inputs. The image prompt is fed into the CLIP image encoder, followed by MLP layers. The obtained coarse visual embedding $f_I$ is then inserted into the text embeddings. The composed embeddings serve as the input for cross attention. The embedding extracted by the encoder provides a coarse encoding of the visual appearance of the image prompt. To further refine the details in the generated videos, at the fine level, we append the latent representation of the image prompt to the cross-frame attention as additional keys and values. Different cross-frame attention layers receive latent representations with different scales. The multi-scale features with spatial details refine the synthesized details.
  • Figure 4: Fine Visual Embedding Refinement. We propose to inject the latent representation of image prompt (here we use the image for illustration purpose) directly into the cross-frame attention module. We use the keys and values from the image prompt to update the values of the first frame firstly. Then, the updated values of the first frame are used to update the remaining frames. Injecting the image prompt in the cross-frame attention helps to transfer the detailed visual characteristics of the image prompts to the synthesized frames. We perform the refinement in different cross-attention layers with different scales.
  • Figure 5: Qualitative Comparison. VideoBooth effectively preserves the fidelity of image prompts and achieves better visual quality.
  • ...and 4 more figures