Boundary Regression for Leitmotif Detection in Music Audio
Sihun Lee, Dasaem Jeong
TL;DR
This work reframes leitmotif detection in music as a boundary-regression problem, rather than frame-wise event detection, by adopting a YOLO-like CNN that outputs motif boundary predictions over a time grid. Audio is transformed with a constant-Q transform and processed in 15-second clips, producing an output tensor of shape $n \times 11 \times (3 + C)$ with $n=3$ and $C=13$, where each prediction includes $p$, $x$, and $w$, and the width is computed as $a_n \exp(w)$ using predetermined anchors $A = { a_1, ..., a_n }$. Training uses a YOLOv3-style multi-part loss, with pitch augmentation and NMS during evaluation, and anchors are selected by k-means on motif boundaries. Results on the Wagner Ring dataset show the approach provides competitive boundary detections while enabling exact motif boundary localization, though act-split generalization remains challenging; this work highlights the potential for precise, boundary-aware analysis of repetitive motifs in music and invites further exploration of boundary-based audio localization methods.
Abstract
Leitmotifs are musical phrases that are reprised in various forms throughout a piece. Due to diverse variations and instrumentation, detecting the occurrence of leitmotifs from audio recordings is a highly challenging task. Leitmotif detection may be handled as a subcategory of audio event detection, where leitmotif activity is predicted at the frame level. However, as leitmotifs embody distinct, coherent musical structures, a more holistic approach akin to bounding box regression in visual object detection can be helpful. This method captures the entirety of a motif rather than fragmenting it into individual frames, thereby preserving its musical integrity and producing more useful predictions. We present our experimental results on tackling leitmotif detection as a boundary regression task.
