Table of Contents
Fetching ...

Cut2Next: Generating Next Shot via In-Context Tuning

Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao, Wanli Ouyang, Ziwei Liu

TL;DR

This work defines Next Shot Generation (NSG) and presents Cut2Next, a Diffusion Transformer-based framework that enforces professional editing patterns and strict cinematic continuity across shot sequences. It introduces a two-stage data regime (RawCuts and CuratedCuts) and a Hierarchical Prompting scheme (Relational and Individual Prompts) guided by Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), enabling high-quality next-shot generation without additional parameters. Evaluations on CutBench show Cut2Next achieves superior visual coherence and text fidelity relative to a strong baseline, with human studies indicating a strong preference for adherence to editing patterns and cinematic continuity. The approach advances narrative video generation by balancing shot diversity with narrative continuity, offering practical potential for film-like content creation and editing-aware synthesis.

Abstract

Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

Cut2Next: Generating Next Shot via In-Context Tuning

TL;DR

This work defines Next Shot Generation (NSG) and presents Cut2Next, a Diffusion Transformer-based framework that enforces professional editing patterns and strict cinematic continuity across shot sequences. It introduces a two-stage data regime (RawCuts and CuratedCuts) and a Hierarchical Prompting scheme (Relational and Individual Prompts) guided by Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), enabling high-quality next-shot generation without additional parameters. Evaluations on CutBench show Cut2Next achieves superior visual coherence and text fidelity relative to a strong baseline, with human studies indicating a strong preference for adherence to editing patterns and cinematic continuity. The approach advances narrative video generation by balancing shot diversity with narrative continuity, offering practical potential for film-like content creation and editing-aware synthesis.

Abstract

Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

Paper Structure

This paper contains 18 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Cut2Next demonstrating versatile Next Shot Generation. The model produces cinematically coherent subsequent shots (bottom) adhering to diverse editing patterns (e.g., Shot/Reverse Shot, Cut-Out, Cutaway) specified alongside the input shots (upper).
  • Figure 2: Canonical cut-driven shot sequences (from CuratedCuts), their narrative functions, and the generation difficulties. Cut-In/Cut-Out: Emphasizes details or shifts focus; challenges models with drastic scale changes while maintaining subject consistency. Cutaway: Provides external or subjective context; demands generating novel yet semantically related content. Shot/Reverse Shot: Facilitates dialogue and reveals reactions; requires consistent character appearance and spatial logic across alternating viewpoints. Multi-Angle: Offers varied viewpoints; requires consistent rendering across significant visual transformations.
  • Figure 3: The data construction pipeline for RawCuts and CuratedCuts.
  • Figure 4: Example of annotating one shot image pair by our Hierarchical Prompt Annotation.
  • Figure 5: Architecture of Cut2Next. Individual prompts ($P^{ind}_{cond}, P^{ind}_{tgt}$) and a relational prompt ($P^{rel}$) are converted to textual embeddings by a shared text encoder. The conditional shot $S_{cond}$ is encoded by a VAE into clean latents, while the target shot $S_{tgt}$ is encoded and noised for training. These textual and visual tokens form the input to the Cut2Next (DiT-based) model. Our Context-Aware Condition Injection (CACI) module (center right) applies distinct conditioning to AdaLN layers based on token type. The Hierarchical Attention Mask (HAM) (far right) further refines information flow by defining specific attention patterns between different token segments.
  • ...and 5 more figures