Table of Contents
Fetching ...

When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach

Daniel Gonzálbez-Biosca, Josep Cabacas-Maso, Carles Ventura, Ismael Benito-Altamirano

TL;DR

The paper tackles automated editing of multicamera classical concerts by splitting the problem into when to cut and how to cut, and proposes a multimodal architecture that fuses log-mel audio features, temporal context $l_{ ext{seg}}$, and optional visual embeddings. A pseudo-labeled dataset of 100 concert videos is built via a hybrid pipeline combining semantic cues (CLIP), thresholding, and confirmation via Gemini, enabling robust evaluation of both temporal segmentation and spatial shot selection. Quantitative results show the unimodal temporal model achieving up to $64.38 ext{%}$ validation accuracy and $62.01 ext{%}$ test accuracy (Recall up to $71.42 ext{%}$; F1 up to $66.03 ext{%}$), outperforming Poisson and exponential baselines, while CLIP-based spatial selection yields Recall@1 around $28.49 ext{%}$ and Recall@3 around $51.97 ext{%}$, surpassing ResNet-50 and Xception backbones. These findings establish the feasibility of multimodal automated editing in this domain and point to directions for finer-grained cut timing, semantic-enabled visual selection, and broader applicability beyond classical concerts.

Abstract

Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.

When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach

TL;DR

The paper tackles automated editing of multicamera classical concerts by splitting the problem into when to cut and how to cut, and proposes a multimodal architecture that fuses log-mel audio features, temporal context , and optional visual embeddings. A pseudo-labeled dataset of 100 concert videos is built via a hybrid pipeline combining semantic cues (CLIP), thresholding, and confirmation via Gemini, enabling robust evaluation of both temporal segmentation and spatial shot selection. Quantitative results show the unimodal temporal model achieving up to validation accuracy and test accuracy (Recall up to ; F1 up to ), outperforming Poisson and exponential baselines, while CLIP-based spatial selection yields Recall@1 around and Recall@3 around , surpassing ResNet-50 and Xception backbones. These findings establish the feasibility of multimodal automated editing in this domain and point to directions for finer-grained cut timing, semantic-enabled visual selection, and broader applicability beyond classical concerts.

Abstract

Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.

Paper Structure

This paper contains 11 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the editing task. Given a multicamera recording of a classical music concert, the system must decide when to cut (horizontal arrow) and how to cut (vertical arrow): that is, which shot transition to perform, and which view to show next.
  • Figure 2: Data collection pipeline for the classical concert video dataset. Videos were downloaded in nHD resolution (360p) using the yt-dlp downloader. The audio was resampled to 16 kHz using ffmpeg, and video frames were extracted at 5 FPS using OpenCV. This intermediate dataset consists of unlabeled multimodal data, including audio and visual components for each video.
  • Figure 3: Pipeline for shot boundary detection combining classical and semantic approaches. A raw video is first processed using the classical shot boundary detector scenedetect, which yields a set of potential cut locations. In parallel, CLIP embeddings are computed for each frame and compared using cosine similarity to estimate content changes between frames. The output from both streams is merged to confirm true shot transitions. Finally, the confirmed shots are passed through a Gemini 1.5 Flash-based model for downstream tasks such as captioning.
  • Figure 4: Overview of the temporal segmentation model. (a) Internal structure of a convolutional block. (b) Architecture of the full multimodal model, which processes audio, video and and time features into a single embedding. The audio input is a log-mel spectrogram, while the time input is a scalar value representing the time elapsed since the last cut. The visual input is an image embedding from a pretrained vision model. The final output is a probability of a scene cut occurring within the given segment. The unimodal model is a simplified version of this architecture, the dashed red rectangle indicates the components that are not present in the unimodal model.
  • Figure 5: ROC curves for the classification task. (a) Validation set: the unimodal and multimodal models showed clearly superior performance over the statistical baselines across all thresholds. (b) Test set: although performance slightly decreased, the models retained good ranking capabilities.
  • ...and 1 more figures