MatchDiffusion: Training-free Generation of Match-cuts
Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, Bernard Ghanem
TL;DR
MatchDiffusion tackles the challenge of generating match-cuts without training a new model by introducing a two-stage diffusion process. It first performs Joint Diffusion to encode a shared structural foundation for two prompts, then uses Disjoint Diffusion to diverge semantically while preserving coherence. The work formalizes match-cut generation as paired video synthesis, proposes robust baselines and evaluation metrics, and demonstrates through qualitative and quantitative results that the method achieves strong prompt adherence and motion consistency, with a training-free workflow and optional user interventions. This approach has practical implications for democratizing match-cut creation and could extend to broader training-free video synthesis tasks, with tunable control via the shared-structure parameter $K$ and potential refinements through prompting and conditioning.
Abstract
Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting match-cuts is a challenging, resource-intensive process requiring deliberate artistic planning. In MatchDiffusion, we present the first training-free method for match-cut generation using text-to-video diffusion models. MatchDiffusion leverages a key property of diffusion models: early denoising steps define the scene's broad structure, while later steps add details. Guided by this insight, MatchDiffusion employs "Joint Diffusion" to initialize generation for two prompts from shared noise, aligning structure and motion. It then applies "Disjoint Diffusion", allowing the videos to diverge and introduce unique details. This approach produces visually coherent videos suited for match-cuts. User studies and metrics demonstrate MatchDiffusion's effectiveness and potential to democratize match-cut creation.
