Cut2Next: Generating Next Shot via In-Context Tuning

Jingwen He; Hongbo Liu; Jiajun Li; Ziqi Huang; Yu Qiao; Wanli Ouyang; Ziwei Liu

Cut2Next: Generating Next Shot via In-Context Tuning

Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao, Wanli Ouyang, Ziwei Liu

TL;DR

This work defines Next Shot Generation (NSG) and presents Cut2Next, a Diffusion Transformer-based framework that enforces professional editing patterns and strict cinematic continuity across shot sequences. It introduces a two-stage data regime (RawCuts and CuratedCuts) and a Hierarchical Prompting scheme (Relational and Individual Prompts) guided by Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), enabling high-quality next-shot generation without additional parameters. Evaluations on CutBench show Cut2Next achieves superior visual coherence and text fidelity relative to a strong baseline, with human studies indicating a strong preference for adherence to editing patterns and cinematic continuity. The approach advances narrative video generation by balancing shot diversity with narrative continuity, offering practical potential for film-like content creation and editing-aware synthesis.

Abstract

Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

Cut2Next: Generating Next Shot via In-Context Tuning

TL;DR

Abstract

Cut2Next: Generating Next Shot via In-Context Tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)