Table of Contents
Fetching ...

VIMI: Grounding Video Generation through Multi-modal Instruction

Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, Sergey Tulyakov

TL;DR

ViMi tackles the lack of visual grounding in text-only video diffusion by introducing retrieval-augmented multimodal pretraining and multimodal instruction tuning. It builds a large multimodal memory (500M image-text pairs) and uses BM25 to retrieve top documents per caption, forming multimodal inputs that condition a diffusion-based video generator via a Multimodal Large Language Model. A two-stage training process first establishes a grounded video generator and then fine-tunes on three tasks—subject-driven generation, video prediction, and text-to-video generation—through multimodal instructions, yielding improved grounding, identity preservation, and temporal coherence. The approach achieves competitive or state-of-the-art results on UCF101 and demonstrates strong zero-shot, multimodal-grounded video generation abilities with robust ablations highlighting the value of retrieval augmentation and instruction tuning.

Abstract

Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in multimodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts and then utilize a two-stage training strategy to enable diverse video generation tasks within the same model. In the first stage, we propose a multimodal conditional video generation framework for pretraining on these augmented datasets, establishing a foundational model for grounded video generation. Secondly, we finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions. This process further refines the model's ability to handle diverse inputs and tasks, ensuring seamless integration of multi-modal information. After this two-stage train-ing process, VIMI demonstrates multimodal understanding capabilities, producing contextually rich and personalized videos grounded in the provided inputs, as shown in Figure 1. Compared to previous visual grounded video generation methods, VIMI can synthesize consistent and temporally coherent videos with large motion while retaining the semantic control. Lastly, VIMI also achieves state-of-the-art text-to-video generation results on UCF101 benchmark.

VIMI: Grounding Video Generation through Multi-modal Instruction

TL;DR

ViMi tackles the lack of visual grounding in text-only video diffusion by introducing retrieval-augmented multimodal pretraining and multimodal instruction tuning. It builds a large multimodal memory (500M image-text pairs) and uses BM25 to retrieve top documents per caption, forming multimodal inputs that condition a diffusion-based video generator via a Multimodal Large Language Model. A two-stage training process first establishes a grounded video generator and then fine-tunes on three tasks—subject-driven generation, video prediction, and text-to-video generation—through multimodal instructions, yielding improved grounding, identity preservation, and temporal coherence. The approach achieves competitive or state-of-the-art results on UCF101 and demonstrates strong zero-shot, multimodal-grounded video generation abilities with robust ablations highlighting the value of retrieval augmentation and instruction tuning.

Abstract

Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in multimodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts and then utilize a two-stage training strategy to enable diverse video generation tasks within the same model. In the first stage, we propose a multimodal conditional video generation framework for pretraining on these augmented datasets, establishing a foundational model for grounded video generation. Secondly, we finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions. This process further refines the model's ability to handle diverse inputs and tasks, ensuring seamless integration of multi-modal information. After this two-stage train-ing process, VIMI demonstrates multimodal understanding capabilities, producing contextually rich and personalized videos grounded in the provided inputs, as shown in Figure 1. Compared to previous visual grounded video generation methods, VIMI can synthesize consistent and temporally coherent videos with large motion while retaining the semantic control. Lastly, VIMI also achieves state-of-the-art text-to-video generation results on UCF101 benchmark.
Paper Structure (31 sections, 5 equations, 7 figures, 1 table)

This paper contains 31 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Examples of ViMi for grounded video generation. Thanks to our visual grounding during retrieval-augmented pretraining and multimodal instruction tuning, our generator can generate videos from multimodal prompts that include multiple image entities. Each multimodal prompt is displayed below the generated videos, illustrating the model's capability to integrate and interpret both textual and visual inputs effectively.
  • Figure 2: Overview of our ViMi framework. (a-left) We first construct a large-scale dataset by employing retrieval methods to pair multimodal in-context with given text prompts. Then we present a multimodal conditional video generation framework for pretraining on these augmented datasets. (b) We propose multimodal instruction tuning for video generation, grounding the model on customized input specified in different multimodal instructions for video generation, including subject-driven video generation, video prediction and text-to-video. By fine-tuning the model with multimodal instructions, we enable ViMi to generate videos that are both contextually rich and visually accurate across a wider range of tasks.
  • Figure 3: Comparison of subject-driven video generation. We compared with concurrent work ID-Animator he2024id for zero-shot human video generation (above) and VideoBooth jiang2023videobooth for general subject-driven video generation (below). Our video generator can synthesize temporally coherent videos with large motion while retaining the semantic control.
  • Figure 4: An overview of our data curation pipeline for subject-driven video generation.
  • Figure 5: Examples of Video Prediction results.
  • ...and 2 more figures