Table of Contents
Fetching ...

UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

TL;DR

UniVideo addresses the absence of unified multimodal modeling in the video domain by coupling a multimodal large language model (MLLM) for instruction understanding with a generation backbone (MMDiT) in a dual-stream architecture. It unifies text-to-video, image-to-video, in-context generation, and in-context editing under a single multimodal instruction paradigm, trained in three stages to align semantic meaning with video synthesis. The model achieves competitive or superior performance across video understanding, generation, and editing benchmarks, and exhibits zero-shot generalization to unseen editing tasks and novel task compositions, including visual prompting. By supporting mask-free editing and translating visual prompts into in-context generation, UniVideo demonstrates the practical advantages of unified multimodal modeling for flexible and scalable video manipulation.

Abstract

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

UniVideo: Unified Understanding, Generation, and Editing for Videos

TL;DR

UniVideo addresses the absence of unified multimodal modeling in the video domain by coupling a multimodal large language model (MLLM) for instruction understanding with a generation backbone (MMDiT) in a dual-stream architecture. It unifies text-to-video, image-to-video, in-context generation, and in-context editing under a single multimodal instruction paradigm, trained in three stages to align semantic meaning with video synthesis. The model achieves competitive or superior performance across video understanding, generation, and editing benchmarks, and exhibits zero-shot generalization to unseen editing tasks and novel task compositions, including visual prompting. By supporting mask-free editing and translating visual prompts into in-context generation, UniVideo demonstrates the practical advantages of unified multimodal modeling for flexible and scalable video manipulation.

Abstract

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

Paper Structure

This paper contains 29 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: UniVideo is a unified system that can understand multi-modal instructions and generate video content. More videos are available on \website.
  • Figure 2: Model architecture. UniVideo is a dual-stream model consisting of an MLLM for understanding and an MMDiT module for generation. While prior work such as Qwen-Image and OmniGen2, explores a similar idea in the image domain, our model generalizes this design to video.
  • Figure 3: UniVideo leverages the MLLM stream to understand and interpret user intent from complex multimodal prompts that cannot be handled by the DiT alone. For example, users can provide diagrams or visual annotations to guide video generation without writing dense textual prompts.
  • Figure 4: Qualitative comparison of UniVideo with SoTA Task Specific Experts on In Context Generation and In Context Editing tasks.
  • Figure 5: Zero-Shot Generalization. We demonstrate two type of generalization. (i) UniVideo was not trained on General Free-form Video Editing data. It transfers this ability from diverse image editing data to the video domain through joint training with in-context video generation and editing data (limited to ID deletion, swapping, addition, and stylization), enabling it to handle previously unseen video editing instructions. (ii) UniVideo can also generalize to novel task compositions, even though it was not explicitly trained on such compositions.
  • ...and 8 more figures