Table of Contents
Fetching ...

MatchDiffusion: Training-free Generation of Match-cuts

Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, Bernard Ghanem

TL;DR

MatchDiffusion tackles the challenge of generating match-cuts without training a new model by introducing a two-stage diffusion process. It first performs Joint Diffusion to encode a shared structural foundation for two prompts, then uses Disjoint Diffusion to diverge semantically while preserving coherence. The work formalizes match-cut generation as paired video synthesis, proposes robust baselines and evaluation metrics, and demonstrates through qualitative and quantitative results that the method achieves strong prompt adherence and motion consistency, with a training-free workflow and optional user interventions. This approach has practical implications for democratizing match-cut creation and could extend to broader training-free video synthesis tasks, with tunable control via the shared-structure parameter $K$ and potential refinements through prompting and conditioning.

Abstract

Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting match-cuts is a challenging, resource-intensive process requiring deliberate artistic planning. In MatchDiffusion, we present the first training-free method for match-cut generation using text-to-video diffusion models. MatchDiffusion leverages a key property of diffusion models: early denoising steps define the scene's broad structure, while later steps add details. Guided by this insight, MatchDiffusion employs "Joint Diffusion" to initialize generation for two prompts from shared noise, aligning structure and motion. It then applies "Disjoint Diffusion", allowing the videos to diverge and introduce unique details. This approach produces visually coherent videos suited for match-cuts. User studies and metrics demonstrate MatchDiffusion's effectiveness and potential to democratize match-cut creation.

MatchDiffusion: Training-free Generation of Match-cuts

TL;DR

MatchDiffusion tackles the challenge of generating match-cuts without training a new model by introducing a two-stage diffusion process. It first performs Joint Diffusion to encode a shared structural foundation for two prompts, then uses Disjoint Diffusion to diverge semantically while preserving coherence. The work formalizes match-cut generation as paired video synthesis, proposes robust baselines and evaluation metrics, and demonstrates through qualitative and quantitative results that the method achieves strong prompt adherence and motion consistency, with a training-free workflow and optional user interventions. This approach has practical implications for democratizing match-cut creation and could extend to broader training-free video synthesis tasks, with tunable control via the shared-structure parameter and potential refinements through prompting and conditioning.

Abstract

Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting match-cuts is a challenging, resource-intensive process requiring deliberate artistic planning. In MatchDiffusion, we present the first training-free method for match-cut generation using text-to-video diffusion models. MatchDiffusion leverages a key property of diffusion models: early denoising steps define the scene's broad structure, while later steps add details. Guided by this insight, MatchDiffusion employs "Joint Diffusion" to initialize generation for two prompts from shared noise, aligning structure and motion. It then applies "Disjoint Diffusion", allowing the videos to diverge and introduce unique details. This approach produces visually coherent videos suited for match-cuts. User studies and metrics demonstrate MatchDiffusion's effectiveness and potential to democratize match-cut creation.

Paper Structure

This paper contains 24 sections, 5 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Automatic match-cut generation with MatchDiffusion. In the history of cinema, there is prevalent use of match-cut transitions, i.e. semantic shifts in the content of two scenes that share the same structure, as exemplified by Stanley Kubrick's iconic transition from a bone to a spaceship (bottom left). However, obtaining visually appealing match-cuts requires sophisticated planning and multiple shots, due to the complexity of the transition. Our proposed MatchDiffusion approach is able to automatically generate match-cuts following textual prompts (right), thanks to a training-free inference technique composed of Joint and Disjoint Diffusion mechanisms (top left).
  • Figure 2: Feature emergence during denoising. While the first iterations (top) yield ambiguous outputs displaying colors and basic structure, further iterations inject semantics (middle), until the final output is generated (bottom).
  • Figure 3: MatchDiffusion. We formulate the task of creating match-cuts as generating a pair of videos sharing a general appearance while having different in semantics. A portion of the frames of these videos can then be combined to enable match-cut transitions. To generate these videos, MatchDiffusion first performs a Joint Diffusion process for $K$ steps (left) by combining the noise predictions from the two prompts via a function $f$. Then, a Disjoint Diffusion process is executed to obtain the final outputs $x'$ and $x"$, i.e. denoising separately for the remaining $T-K$ iterations with one prompt per path. Optionally, MatchDiffusion also supports manual user intervention by allowing the integration of generated video tone and structural edits.
  • Figure 4: User intervention. For reproducing the match-cut in the teaser, we apply a background mask to the denoised output generated by joint diffusion. After the remaining denoising iterations, the output is refined to integrate the new background.
  • Figure 5: Generated match-cuts. MatchDiffusion can automatically synthesize match-cuts based on the prompts in green and red. Note how the cuts enjoy highly consistent appearance while preserving each prompt's semantics. Please see the supplementary for more samples.
  • ...and 12 more figures