Table of Contents
Fetching ...

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qiang Xu

TL;DR

FullDiT tackles the need for fine-grained, multi-condition video generation by unifying diverse input signals into a single sequence and processing them with full self-attention. The model tokenizes text, camera, identities, and depth into modality-specific sequences, concatenates them, and uses 2D and 3D self-attention with RoPE to capture spatiotemporal interactions, aided by AdaLN-Zero conditioning. A progressive training strategy and the FullBench benchmark enable robust evaluation of multi-task generation, showing state-of-the-art results and emergent abilities to compose unseen condition combinations. These advances reduce parameter overhead, avoid branch conflicts, and enable scalable multi-task video generation with practical implications for multimedia production and AI-assisted content creation.

Abstract

Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

TL;DR

FullDiT tackles the need for fine-grained, multi-condition video generation by unifying diverse input signals into a single sequence and processing them with full self-attention. The model tokenizes text, camera, identities, and depth into modality-specific sequences, concatenates them, and uses 2D and 3D self-attention with RoPE to capture spatiotemporal interactions, aided by AdaLN-Zero conditioning. A progressive training strategy and the FullBench benchmark enable robust evaluation of multi-task generation, showing state-of-the-art results and emergent abilities to compose unseen condition combinations. These advances reduce parameter overhead, avoid branch conflicts, and enable scalable multi-task video generation with practical implications for multimedia production and AI-assisted content creation.

Abstract

Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

Paper Structure

This paper contains 16 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: FullDiT is a multi-task video generative foundation model that unifies conditional learning with full self-attention. With self-attention’s long-context learning ability, FullDiT can flexibly take different combinations of input to generate high-quality videos.
  • Figure 2: Overview of FullDiT architecture and comparison with adapter-based models. We present the diffusion process of the multi-task video generative model on the left. For research purposes, this paper shows input conditions consisting of temporal-only cameras, spatial-only identities, and temporal-spatial depth video. Additional conditions can be incorporated into this model architecture for broader applications. Shown in (a), FullDiT unifies various inputs with procedures: (1) patchify and tokenize the input condition to a unified sequence representation, (2) concat all sequences together to a longer one, and (3) learn condition dynamics with full self-attention. By comparison, earlier adapter-based approaches (shown in (b)) use distinct adapter designs that operate independently to process various inputs, leading to branch conflicts, parameter redundancy, and suboptimal performance. Each block's subscript indicates its layer index.
  • Figure 3: Illustration of the condition training order. We use red to indicate the training data volume. M is for million.
  • Figure 4: Examples of two types of identity images.
  • Figure 5: Qualitative comparison of FullDiT and previous single control video generation methods. We present identity-to-video results compared with ConceptMaster conceptmaster, depth-to-video results compared with Ctrl-Adapter ctrladapter and ControlVideo controlvideo, and camera-to-video results compared with MotionCtrl motionctrl, CamI2V cami2v, and CameraCtrl cameractrl. Results denoted with * are image-to-video methods.
  • ...and 2 more figures