FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

Xuan Ju; Weicai Ye; Quande Liu; Qiulin Wang; Xintao Wang; Pengfei Wan; Di Zhang; Kun Gai; Qiang Xu

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qiang Xu

TL;DR

FullDiT tackles the need for fine-grained, multi-condition video generation by unifying diverse input signals into a single sequence and processing them with full self-attention. The model tokenizes text, camera, identities, and depth into modality-specific sequences, concatenates them, and uses 2D and 3D self-attention with RoPE to capture spatiotemporal interactions, aided by AdaLN-Zero conditioning. A progressive training strategy and the FullBench benchmark enable robust evaluation of multi-task generation, showing state-of-the-art results and emergent abilities to compose unseen condition combinations. These advances reduce parameter overhead, avoid branch conflicts, and enable scalable multi-task video generation with practical implications for multimedia production and AI-assisted content creation.

Abstract

Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

TL;DR

Abstract

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)