Table of Contents
Fetching ...

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia

TL;DR

<3-5 sentence high-level summary> UnityVideo tackles the limitation of single-modality conditioning in video generation by unifying multiple visual modalities (depth, optical flow, segmentation, skeleton, DensePose) and training paradigms within a diffusion-transformer framework. It introduces a dynamic noise scheduling strategy, a modality-adaptive switcher, and an in-context learner to enable plug-and-play processing and cross-modal reasoning, backed by large-scale OpenUni data and UniBench evaluation. The approach yields faster convergence, stronger zero-shot generalization, and improved alignment with physical world constraints across text-to-video, controllable generation, and modality estimation. Together with OpenUni and UniBench, UnityVideo demonstrates robust, scalable unified multimodal world modeling for future video-generation and perception systems.

Abstract

Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

TL;DR

<3-5 sentence high-level summary> UnityVideo tackles the limitation of single-modality conditioning in video generation by unifying multiple visual modalities (depth, optical flow, segmentation, skeleton, DensePose) and training paradigms within a diffusion-transformer framework. It introduces a dynamic noise scheduling strategy, a modality-adaptive switcher, and an in-context learner to enable plug-and-play processing and cross-modal reasoning, backed by large-scale OpenUni data and UniBench evaluation. The approach yields faster convergence, stronger zero-shot generalization, and improved alignment with physical world constraints across text-to-video, controllable generation, and modality estimation. Together with OpenUni and UniBench, UnityVideo demonstrates robust, scalable unified multimodal world modeling for future video-generation and perception systems.

Abstract

Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo

Paper Structure

This paper contains 42 sections, 1 equation, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Evolution of attention patterns in UnityVideo. Analysis of attention maps shows that interactions between RGB and auxiliary modalities strengthen progressively across layers. Meanwhile, the model’s text-following behavior and spatial reasoning capabilities also improve, reflecting more coherent cross-modal integration.
  • Figure 2: Training on unified modalities benefits video generation. Unified multi-modal and multi-task joint training achieves the lowest final loss on RGB video generation, outperforming single-modality joint training and RGB finetuning baseline.
  • Figure 2: Comparison of physical understanding. UnityVideo demonstrates stronger physical reasoning and improved text alignment compared with current state-of-the-art video generation models.
  • Figure 3: Overview of UnityVideo. UnityVideo achieves task unification through a dynamic noise injection strategy applied to input tokens (left), and modality unification via the proposed Modality-Aware AdaLN Table (center). Specifically, $L_r$ and $L_m$ denote the learnable parameter tables for the RGB modality and auxiliary video-related modalities (e.g., depth, optical flow, DensePose, skeleton), respectively. $C_{r}$ and $C_{m}$ represent the prompt condition for RGB video content and in-context modaliy learning prompt, while $V_r$ and $V_m$ correspond to the token sequences from the RGB and auxiliary modalities, respectively.
  • Figure 3: UniBench consists of two complementary components: (i) high-fidelity Unreal Engine depth data for evaluating depth estimation, and (ii) diverse real-world videos with rich multimodal annotations for assessing video generation quality.
  • ...and 6 more figures