Table of Contents
Fetching ...

Omni-Video: Democratizing Unified Video Understanding and Generation

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, Hao Li

TL;DR

<3-5 sentence high-level summary>Omni-Video addresses the need for a unified framework that handles video understanding, generation, and editing within a single model. It introduces a lightweight two-head MLLM and an adapter to connect to diffusion-based text-to-video decoders, enabling continuous visual token generation and conditioning. The training uses a three-stage, resource-efficient strategy with multi-task data and a Think Mode to improve instruction understanding. Experiments show competitive performance on text-to-image/video generation, long-range video generation, and editing, with the think mode providing consistent gains. The work offers a practical pathway to scalable unified video modeling with strong generalization across tasks.

Abstract

Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.

Omni-Video: Democratizing Unified Video Understanding and Generation

TL;DR

<3-5 sentence high-level summary>Omni-Video addresses the need for a unified framework that handles video understanding, generation, and editing within a single model. It introduces a lightweight two-head MLLM and an adapter to connect to diffusion-based text-to-video decoders, enabling continuous visual token generation and conditioning. The training uses a three-stage, resource-efficient strategy with multi-task data and a Think Mode to improve instruction understanding. Experiments show competitive performance on text-to-image/video generation, long-range video generation, and editing, with the think mode providing consistent gains. The work offers a practical pathway to scalable unified video modeling with strong generalization across tasks.

Abstract

Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.

Paper Structure

This paper contains 34 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Comprehensive illustration of the fundamental capabilities of our Omni-Video on (a) photorealistic text-to-image/video generation, (b) image/video editing (i.e., change the style of source image toward target image and remove objects in the source image), and (c) video understanding (describe directions, background, motion, and camera movement, etc.), all these tasks could be accomplished within a single unified architecture.
  • Figure 2: Overall architecture of our proposed Omni-Video for unified video understanding and generation. We connect MLLMs' exceptional understanding capability with the visual generation ability of diffusion decoders with lightweight architectural design, enabling MLLMs to produce visual continuous tokens that are decoded into photorealistic images/videos by the diffusion decoder.
  • Figure 3: Distribution of datasets used in the joint vision and language pretraining stage.
  • Figure 4: Illustration of our multi-stages training strategy. The flames denote the module parameters are trainable at each stage, and the block indicates that the model parameters are frozen.
  • Figure 5: Example images generated by our proposed Omni-Video on text-to-image settings.
  • ...and 5 more figures