Table of Contents
Fetching ...

CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, Xinyuan Chen

TL;DR

CineTrans tackles the challenge of generating coherent multi-shot videos with film-style transitions by uncovering a correspondence between diffusion-model attention and shot boundaries, then imposing a mask-based control to enforce cinematic transitions. A dedicated Cine250K dataset with frame-level shot labels and hierarchical captions supports training and evaluation for film-editing-style generation. The method combines attention-analysis-driven masking with training-time fine-tuning or training-free variants (and LoRA customization) to achieve precise transitions, strong inter- and intra-shot consistency, and high aesthetic quality. Comprehensive metrics and user studies demonstrate substantial improvements over baselines, highlighting the viability of diffusion-based, controllable multi-shot video synthesis.

Abstract

Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.

CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

TL;DR

CineTrans tackles the challenge of generating coherent multi-shot videos with film-style transitions by uncovering a correspondence between diffusion-model attention and shot boundaries, then imposing a mask-based control to enforce cinematic transitions. A dedicated Cine250K dataset with frame-level shot labels and hierarchical captions supports training and evaluation for film-editing-style generation. The method combines attention-analysis-driven masking with training-time fine-tuning or training-free variants (and LoRA customization) to achieve precise transitions, strong inter- and intra-shot consistency, and high aesthetic quality. Comprehensive metrics and user studies demonstrate substantial improvements over baselines, highlighting the viability of diffusion-based, controllable multi-shot video synthesis.

Abstract

Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.

Paper Structure

This paper contains 30 sections, 9 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Multi-shot videos generated by CineTrans, which enables cinematic transitions aligning with film editing. The corresponding mask is constructed based on the timestamps of the shots, thereby controlling cinematic transitions Project page: https://uknowsth.github.io/CineTrans/
  • Figure 2: Overview of CineTrans. Existing video generation models focus primarily on single-shot video. The multi-shot video generation cases often follow several failures, remaining unstable and uncontrollable. Observations of these multi-shot cases reveal a structured pattern in attention layers. Based on this insight, we introduce a mask mechanism and fine-tune the model with our constructed dataset Cine250K, resulting in significantly improved performance.
  • Figure 3: Dataset curation pipeline. The raw video is split into several clips and then selectively stitched based on semantic features. A selection process then chooses high‑quality multi‑shot videos. After initial assembly, gradual changes are removed. Finally, a language model is used to annotate each video with a general caption and each shot with its shot caption, yielding temporally dense annotations.
  • Figure 4: We observe that in multi‑shot scenarios the attention maps form a block‑diagonal pattern, i.e., certain layers exhibit higher intra‑shot than inter‑shot frame correlations, so we design a corresponding masking mechanism. Using predefined transition points, the mask is applied to those layers of the diffusion model to guide cinematic multi‑shot video generation.
  • Figure 5: Qualitative results for different methods. Our proposed CineTrans outperforms others in transition control while preserving coherence between shots, aligning with film-editing styles. The figure illustrates the shot segmentation results and specified shot count.
  • ...and 15 more figures