Table of Contents
Fetching ...

UniVid: The Open-Source Unified Video Model

Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang

TL;DR

UniVid presents an open-source unified video model that jointly handles video understanding and generation by coupling a multimodal large language model with a diffusion-based video decoder via a lightweight adapter. It introduces Temperature Modality Alignment to preserve semantic faithfulness during early diffusion steps and Pyramid Reflection for efficient, query-driven temporal reasoning. Through a three-stage training pipeline and extensive ablations, UniVid achieves state-of-the-art or competitive results on VBench-Long and multiple video QA benchmarks with a 7B-scale backbone, while requiring only modest fine-tuning data. The approach demonstrates the practicality and effectiveness of unified video intelligence, offering an efficient path toward combined reasoning and high-fidelity generation in open-source form.

Abstract

Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines. Code: https://github.com/AIGeeksGroup/UniVid. Website: https://aigeeksgroup.github.io/UniVid.

UniVid: The Open-Source Unified Video Model

TL;DR

UniVid presents an open-source unified video model that jointly handles video understanding and generation by coupling a multimodal large language model with a diffusion-based video decoder via a lightweight adapter. It introduces Temperature Modality Alignment to preserve semantic faithfulness during early diffusion steps and Pyramid Reflection for efficient, query-driven temporal reasoning. Through a three-stage training pipeline and extensive ablations, UniVid achieves state-of-the-art or competitive results on VBench-Long and multiple video QA benchmarks with a 7B-scale backbone, while requiring only modest fine-tuning data. The approach demonstrates the practicality and effectiveness of unified video intelligence, offering an efficient path toward combined reasoning and high-fidelity generation in open-source form.

Abstract

Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines. Code: https://github.com/AIGeeksGroup/UniVid. Website: https://aigeeksgroup.github.io/UniVid.

Paper Structure

This paper contains 36 sections, 16 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: We present UniVid, an open-source unified video model for both understanding and generation tasks. Our model requires only a small amount of high-quality data for fine-tuning, achieveing competitive results across various tasks.
  • Figure 2: Overall architecture of our proposed UniVid for unified video understanding and generation. UniVid couples an autoregressive-based MLLM with a DiT-based diffusion decoder. The MLLM's outputs are linked through a lightweight adapter to interface with the Wan Wan backbone, forming the generation branch, while simultaneously passing through the Pyramid Reflection module to connect with the LLM, thereby establishing the understanding branch.
  • Figure 3: Comparisons with State-of-the-Art Video Generation Models Wanminimax2024videohunyuanvideo2024easyanimate2024.
  • Figure 4: Comparisons of State-of-the-Art Video Understanding Models Video-LLavaSF-LLaVA-7B.
  • Figure 5: The qualitative results of the video understanding. Blue for static questions, green for dynamic questions.
  • ...and 3 more figures