Table of Contents
Fetching ...

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang

TL;DR

Mask$^2$DiT introduces a dual-mask diffusion transformer for multi-scene long video generation, achieving fine-grained one-to-one alignment between text prompts and video segments while preserving inter-scene coherence. A symmetric binary attention mask enforces segment-specific text attention, and a segment-level conditional mask enables auto-regressive extension beyond a fixed number of scenes. The model is trained in two stages (pre-training on concatenated single-scene clips and supervised fine-tuning with segment-level conditioning), and it outperforms state-of-the-art methods in visual and semantic consistency, with strong temporal coherence and efficient computation. The approach broadens practical capabilities for multi-scene video creation and paves the way for longer, semantically aligned video content generation.

Abstract

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

TL;DR

MaskDiT introduces a dual-mask diffusion transformer for multi-scene long video generation, achieving fine-grained one-to-one alignment between text prompts and video segments while preserving inter-scene coherence. A symmetric binary attention mask enforces segment-specific text attention, and a segment-level conditional mask enables auto-regressive extension beyond a fixed number of scenes. The model is trained in two stages (pre-training on concatenated single-scene clips and supervised fine-tuning with segment-level conditioning), and it outperforms state-of-the-art methods in visual and semantic consistency, with strong temporal coherence and efficient computation. The approach broadens practical capabilities for multi-scene video creation and paves the way for longer, semantically aligned video content generation.

Abstract

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose MaskDiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that MaskDiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

Paper Structure

This paper contains 15 sections, 3 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: The overall pipeline of our method Mask$^2$DiT. First, we concatenate the text and video token sequences of $n$ scenes in temporal order, where $n=3$ is a fixed constant. The text tokens (indices 0-5) are placed at the beginning, followed by the video tokens (indices 6-E). Then, we introduce a symmetric binary attention mask to ensure that text annotations affect only its corresponding video frame range, while still preserving temporal continuity across all visual tokens. Finally, we introduce a segment-level conditional mask that predicts the final segment based on the preceding $n-1$ video segments, equipping the model with the capability to extend scenes auto-regressively.
  • Figure 2: Attention mask variants for exploration. The first configuration serves as the baseline. The second is the most computationally efficient, achieving a balance between semantic consistency by aligning each video segment with its corresponding text and visual consistency across all segments. The three rightmost configurations are further explored to capture additional inter-token correlations.
  • Figure 3: The illustration of our grouped attention mechanism for Text Token Group 1 and Scene Token Group 1. For Text Token Group 1, the Query consists of text tokens from the first scene segment, while the Key and Value are formed by concatenating text and video tokens of the same scene segment. For Scene Token Group 1, the Query is video tokens from the first scene segment, and the Key and Value are obtained by concatenating text tokens of the first scene segment with the video tokens from all scenes.
  • Figure 4: Illustration of the supervised fine-tuning dataset construction process. Long-form videos are segmented into 10-minute clips, then divided into individual shots using PySceneDetect pyscenedetect. Shot combinations are filtered based on inter-video relevance, computed using ViClip wang2023internvid. Descriptive annotations that preserve character and background consistency are generated using Gemini team2023gemini.
  • Figure 5: Structure of the evaluation dataset. It contains 50 scenes, each accompanied by three prompts featuring a variable number of characters. ChatGPT openai2024chatgpt was used to generate diverse prompts based on this structure, enabling comprehensive evaluation of video generation models in multi-scene scenarios.
  • ...and 12 more figures