When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach
Daniel Gonzálbez-Biosca, Josep Cabacas-Maso, Carles Ventura, Ismael Benito-Altamirano
TL;DR
The paper tackles automated editing of multicamera classical concerts by splitting the problem into when to cut and how to cut, and proposes a multimodal architecture that fuses log-mel audio features, temporal context $l_{ ext{seg}}$, and optional visual embeddings. A pseudo-labeled dataset of 100 concert videos is built via a hybrid pipeline combining semantic cues (CLIP), thresholding, and confirmation via Gemini, enabling robust evaluation of both temporal segmentation and spatial shot selection. Quantitative results show the unimodal temporal model achieving up to $64.38 ext{%}$ validation accuracy and $62.01 ext{%}$ test accuracy (Recall up to $71.42 ext{%}$; F1 up to $66.03 ext{%}$), outperforming Poisson and exponential baselines, while CLIP-based spatial selection yields Recall@1 around $28.49 ext{%}$ and Recall@3 around $51.97 ext{%}$, surpassing ResNet-50 and Xception backbones. These findings establish the feasibility of multimodal automated editing in this domain and point to directions for finer-grained cut timing, semantic-enabled visual selection, and broader applicability beyond classical concerts.
Abstract
Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.
