Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Tianhao Qi; Jianlong Yuan; Wanquan Feng; Shancheng Fang; Jiawei Liu; SiYu Zhou; Qian He; Hongtao Xie; Yongdong Zhang

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang

TL;DR

Mask$^2$DiT introduces a dual-mask diffusion transformer for multi-scene long video generation, achieving fine-grained one-to-one alignment between text prompts and video segments while preserving inter-scene coherence. A symmetric binary attention mask enforces segment-specific text attention, and a segment-level conditional mask enables auto-regressive extension beyond a fixed number of scenes. The model is trained in two stages (pre-training on concatenated single-scene clips and supervised fine-tuning with segment-level conditioning), and it outperforms state-of-the-art methods in visual and semantic consistency, with strong temporal coherence and efficient computation. The approach broadens practical capabilities for multi-scene video creation and paves the way for longer, semantically aligned video content generation.

Abstract

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

TL;DR

Mask

DiT introduces a dual-mask diffusion transformer for multi-scene long video generation, achieving fine-grained one-to-one alignment between text prompts and video segments while preserving inter-scene coherence. A symmetric binary attention mask enforces segment-specific text attention, and a segment-level conditional mask enables auto-regressive extension beyond a fixed number of scenes. The model is trained in two stages (pre-training on concatenated single-scene clips and supervised fine-tuning with segment-level conditioning), and it outperforms state-of-the-art methods in visual and semantic consistency, with strong temporal coherence and efficient computation. The approach broadens practical capabilities for multi-scene video creation and paves the way for longer, semantically aligned video content generation.

Abstract

DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask

DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

TL;DR

Abstract

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)