Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos
Dennis Fedorishin, Lie Lu, Srirangaraj Setlur, Venu Govindaraju
TL;DR
This work tackles the problem of automatic audio match cuts—fluid sound transitions that bridge two scenes—by proposing a self-supervised retrieval-and-transition framework. It defines audio match cut as a unimodal audio retrieval task using maximum inner-product search between $z_{V_q}$ and $z_{G_i}$, followed by an adaptive audio blending operation. A novel Split-and-Contrast objective refines audio representations by aligning adjacent frames within the same video and contrasting non-adjacent ones, leveraging a frozen CLAP encoder with trainable projections and $p$-point temperature $\tau$. For transitions, the Max-Sub-Spectrogram method locates optimal crossfade points in the spectrogram domain and uses adaptive crossfades with length $l_{\text{crossfade}} = 1 / (\mathrm{Var}(\overline{M}) \phi)$, with $\overline{M} = (S_{Q_i}^T S_{M_j}) / (||S_{Q_i}||\,||S_{M_j}||)$, yielding higher-quality matches. The approach is evaluated on Audioset and Movieclips, showing that deep representations (CLAP, ImageBind) outperform traditional features, and that the proposed Split-and-Contrast and adaptive transition strategies improve both retrieval and transition quality, providing a practical tool for editors in trailer and montage creation.
Abstract
A "match cut" is a common video editing technique where a pair of shots that have a similar composition transition fluidly from one to another. Although match cuts are often visual, certain match cuts involve the fluid transition of audio, where sounds from different sources merge into one indistinguishable transition between two shots. In this paper, we explore the ability to automatically find and create "audio match cuts" within videos and movies. We create a self-supervised audio representation for audio match cutting and develop a coarse-to-fine audio match pipeline that recommends matching shots and creates the blended audio. We further annotate a dataset for the proposed audio match cut task and compare the ability of multiple audio representations to find audio match cut candidates. Finally, we evaluate multiple methods to blend two matching audio candidates with the goal of creating a smooth transition. Project page and examples are available at: https://denfed.github.io/audiomatchcut/
