Table of Contents
Fetching ...

LayerT2V: A Unified Multi-Layer Video Generation Framework

Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Lei Zhang, Xiaohong Liu

TL;DR

A unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes.

Abstract

Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose \textbf{LayerT2V}, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce \textbf{VidLayer}, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence.

LayerT2V: A Unified Multi-Layer Video Generation Framework

TL;DR

A unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes.

Abstract

Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose \textbf{LayerT2V}, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce \textbf{VidLayer}, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence.

Paper Structure

This paper contains 46 sections, 12 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Training pipeline and architecture of LayerT2V. (a) Stage 1: Mask VAE adaptation, where the pretrained VAE encoder is frozen and a lightweight projection plus VAE decoder is trained to reconstruct high-quality alpha mattes. (b) Stage 2: Multi-layer generation with a DiT backbone that jointly models text tokens, video tokens, and mask tokens to generate Full Video, Background, Foreground, and Alpha Mask. LayerAdaLN injects layer identity into the timestep modulation, and layer-aware cross-attention conditions each layer on its corresponding text prompt to improve layer separation and cross-layer coherence.
  • Figure 2: Data construction pipeline of VidLayer dataset.
  • Figure 3: Left: Visualization samples of VidLayer dataset, it involves layered contents and corresponding layered prompts. Right: Scene classifications of VidLayer and semantic redundancy of dataset prompts. For semantic redundancy in text prompts, we extract text embeddings using CLIP CLIP and set a cosine similarity threshold of 0.85 to identify duplicates.
  • Figure 4: Qualitative results. LayerT2V generates high-fidelity multi-layer videos across three generation modes: (a) single-foreground with a single subject, (b) single-foreground with multiple subjects, and (c) multi-foreground joint generation with independent layers. Our method produces clean foreground separation, sharp alpha mattes, and complete backgrounds without leakage or boundary artifacts across diverse scenes and motion patterns.
  • Figure 5: Qualitative comparison. Compared to LayerFlow, LayerT2V produces higher-quality video layers with stronger temporal consistency and better text alignment. BL/BG/FG correspond to the full-video/background/foreground prompts. Note that LayerFlow outputs the foreground as RGB (without alpha), as it claims RGB foregrounds achieve higher visual quality than RGBA.
  • ...and 10 more figures