Table of Contents
Fetching ...

Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, Hao Li

TL;DR

Omni-Video 2 tackles the challenge of unified video generation and editing by linking pretrained multimodal large-language models with diffusion-based video priors through a lightweight, parameter-efficient conditioning scheme. A key contribution is the Editing Prompt Reasoner, which converts user prompts into explicit target captions, paired with a Multimodal Condition Adapter that injects cross-attention-based conditioning without altering the diffusion backbone. The model scales to a $14$B diffusion backbone and is trained on a carefully curated, multi-task dataset to preserve attribution to pretrained priors while enabling diverse editing tasks. Empirical results on FiVE and VBench show state-of-the-art instruction following in editing and strong generation quality, all achieved without task-specific architectural changes. The work offers a practical path toward scalable, unified video modeling with broad applicability and ready-to-use resources.

Abstract

We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.

Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

TL;DR

Omni-Video 2 tackles the challenge of unified video generation and editing by linking pretrained multimodal large-language models with diffusion-based video priors through a lightweight, parameter-efficient conditioning scheme. A key contribution is the Editing Prompt Reasoner, which converts user prompts into explicit target captions, paired with a Multimodal Condition Adapter that injects cross-attention-based conditioning without altering the diffusion backbone. The model scales to a B diffusion backbone and is trained on a carefully curated, multi-task dataset to preserve attribution to pretrained priors while enabling diverse editing tasks. Empirical results on FiVE and VBench show state-of-the-art instruction following in editing and strong generation quality, all achieved without task-specific architectural changes. The work offers a practical path toward scalable, unified video modeling with broad applicability and ready-to-use resources.

Abstract

We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.
Paper Structure (15 sections, 2 equations, 11 figures, 2 tables)

This paper contains 15 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: We present Omni-Video 2, a unified video model for video generation and editing. Our model supports various video editing tasks including local object changes (e.g., removal, addition) and global changes (e.g., style, background), producing edits that are both faithful to the user's prompt and temporally coherent with the original video, even when tested on videos with high motions.
  • Figure 2: Overall framework of Omni-Video 2 for unified video modeling. An MLLM-based editing prompt reasoner first interprets the user instructions in the context of the source video to produce a precise target caption. A lightweight adapter then injects the multimodal conditional guidance into a powerful, pre-trained T2V diffusion model to perform editing. Such design efficiently combines the MLLM's advanced reasoning with the T2V model's strong generative priors, enabling complex edits without costly full-model retraining.
  • Figure 3: Video editing data instruction categories.
  • Figure 4: Composition of the final training dataset.
  • Figure 5: Qualitative editing results on adding local object. Omni-Video 2 accurately adds new objects based on the editing instructions while preserving the temporal consistency of the original video. The generated objects are realistic and well-integrated into the scene.
  • ...and 6 more figures