Table of Contents
Fetching ...

Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

Dennis Fedorishin, Lie Lu, Srirangaraj Setlur, Venu Govindaraju

TL;DR

This work tackles the problem of automatic audio match cuts—fluid sound transitions that bridge two scenes—by proposing a self-supervised retrieval-and-transition framework. It defines audio match cut as a unimodal audio retrieval task using maximum inner-product search between $z_{V_q}$ and $z_{G_i}$, followed by an adaptive audio blending operation. A novel Split-and-Contrast objective refines audio representations by aligning adjacent frames within the same video and contrasting non-adjacent ones, leveraging a frozen CLAP encoder with trainable projections and $p$-point temperature $\tau$. For transitions, the Max-Sub-Spectrogram method locates optimal crossfade points in the spectrogram domain and uses adaptive crossfades with length $l_{\text{crossfade}} = 1 / (\mathrm{Var}(\overline{M}) \phi)$, with $\overline{M} = (S_{Q_i}^T S_{M_j}) / (||S_{Q_i}||\,||S_{M_j}||)$, yielding higher-quality matches. The approach is evaluated on Audioset and Movieclips, showing that deep representations (CLAP, ImageBind) outperform traditional features, and that the proposed Split-and-Contrast and adaptive transition strategies improve both retrieval and transition quality, providing a practical tool for editors in trailer and montage creation.

Abstract

A "match cut" is a common video editing technique where a pair of shots that have a similar composition transition fluidly from one to another. Although match cuts are often visual, certain match cuts involve the fluid transition of audio, where sounds from different sources merge into one indistinguishable transition between two shots. In this paper, we explore the ability to automatically find and create "audio match cuts" within videos and movies. We create a self-supervised audio representation for audio match cutting and develop a coarse-to-fine audio match pipeline that recommends matching shots and creates the blended audio. We further annotate a dataset for the proposed audio match cut task and compare the ability of multiple audio representations to find audio match cut candidates. Finally, we evaluate multiple methods to blend two matching audio candidates with the goal of creating a smooth transition. Project page and examples are available at: https://denfed.github.io/audiomatchcut/

Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

TL;DR

This work tackles the problem of automatic audio match cuts—fluid sound transitions that bridge two scenes—by proposing a self-supervised retrieval-and-transition framework. It defines audio match cut as a unimodal audio retrieval task using maximum inner-product search between and , followed by an adaptive audio blending operation. A novel Split-and-Contrast objective refines audio representations by aligning adjacent frames within the same video and contrasting non-adjacent ones, leveraging a frozen CLAP encoder with trainable projections and -point temperature . For transitions, the Max-Sub-Spectrogram method locates optimal crossfade points in the spectrogram domain and uses adaptive crossfades with length , with , yielding higher-quality matches. The approach is evaluated on Audioset and Movieclips, showing that deep representations (CLAP, ImageBind) outperform traditional features, and that the proposed Split-and-Contrast and adaptive transition strategies improve both retrieval and transition quality, providing a practical tool for editors in trailer and montage creation.

Abstract

A "match cut" is a common video editing technique where a pair of shots that have a similar composition transition fluidly from one to another. Although match cuts are often visual, certain match cuts involve the fluid transition of audio, where sounds from different sources merge into one indistinguishable transition between two shots. In this paper, we explore the ability to automatically find and create "audio match cuts" within videos and movies. We create a self-supervised audio representation for audio match cutting and develop a coarse-to-fine audio match pipeline that recommends matching shots and creates the blended audio. We further annotate a dataset for the proposed audio match cut task and compare the ability of multiple audio representations to find audio match cut candidates. Finally, we evaluate multiple methods to blend two matching audio candidates with the goal of creating a smooth transition. Project page and examples are available at: https://denfed.github.io/audiomatchcut/
Paper Structure (11 sections, 2 equations, 3 figures, 2 tables)

This paper contains 11 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Example match cuts in movies. In 2001: A Space Odysseyspaceodyssey (top), two different visuals transition fluidly based on the similar size and shape of the objects. In The Chronicles of Narnia: The Lion, the Witch and the Wardrobenarnia (bottom), The sound of a sword clinking within its sheath matched to the strike of a hammer in the next scene, creating a seamless audio match across scenes.
  • Figure 2: a) Proposed Framework. Given a query video, we retrieve an audio match cut candidate from a video gallery and find the optimal transition point using a sub-spectrogram similarity search. Using the variance of the created similarity matrix, we adaptively select the crossfade length to blend both the query and match audio into a fluid audio match cut. b) Proposed "Split-and-Contrast" contrastive objective. Each audio sample is split at a randomly selected frame, then the adjacent frames of the split are contrasted towards each other.
  • Figure 3: Example sub-spectrogram similarities of audio match cuts: A forging hammer striking matched with a knife chopping (left) exhibits high similarity on each strike occurrence. A blender matched with a motorcycle revving (right) shows a smoother similarity matrix, allowing for plausible transitions across multiple time steps.