Table of Contents
Fetching ...

Boundary Regression for Leitmotif Detection in Music Audio

Sihun Lee, Dasaem Jeong

TL;DR

This work reframes leitmotif detection in music as a boundary-regression problem, rather than frame-wise event detection, by adopting a YOLO-like CNN that outputs motif boundary predictions over a time grid. Audio is transformed with a constant-Q transform and processed in 15-second clips, producing an output tensor of shape $n \times 11 \times (3 + C)$ with $n=3$ and $C=13$, where each prediction includes $p$, $x$, and $w$, and the width is computed as $a_n \exp(w)$ using predetermined anchors $A = { a_1, ..., a_n }$. Training uses a YOLOv3-style multi-part loss, with pitch augmentation and NMS during evaluation, and anchors are selected by k-means on motif boundaries. Results on the Wagner Ring dataset show the approach provides competitive boundary detections while enabling exact motif boundary localization, though act-split generalization remains challenging; this work highlights the potential for precise, boundary-aware analysis of repetitive motifs in music and invites further exploration of boundary-based audio localization methods.

Abstract

Leitmotifs are musical phrases that are reprised in various forms throughout a piece. Due to diverse variations and instrumentation, detecting the occurrence of leitmotifs from audio recordings is a highly challenging task. Leitmotif detection may be handled as a subcategory of audio event detection, where leitmotif activity is predicted at the frame level. However, as leitmotifs embody distinct, coherent musical structures, a more holistic approach akin to bounding box regression in visual object detection can be helpful. This method captures the entirety of a motif rather than fragmenting it into individual frames, thereby preserving its musical integrity and producing more useful predictions. We present our experimental results on tackling leitmotif detection as a boundary regression task.

Boundary Regression for Leitmotif Detection in Music Audio

TL;DR

This work reframes leitmotif detection in music as a boundary-regression problem, rather than frame-wise event detection, by adopting a YOLO-like CNN that outputs motif boundary predictions over a time grid. Audio is transformed with a constant-Q transform and processed in 15-second clips, producing an output tensor of shape with and , where each prediction includes , , and , and the width is computed as using predetermined anchors . Training uses a YOLOv3-style multi-part loss, with pitch augmentation and NMS during evaluation, and anchors are selected by k-means on motif boundaries. Results on the Wagner Ring dataset show the approach provides competitive boundary detections while enabling exact motif boundary localization, though act-split generalization remains challenging; this work highlights the potential for precise, boundary-aware analysis of repetitive motifs in music and invites further exploration of boundary-based audio localization methods.

Abstract

Leitmotifs are musical phrases that are reprised in various forms throughout a piece. Due to diverse variations and instrumentation, detecting the occurrence of leitmotifs from audio recordings is a highly challenging task. Leitmotif detection may be handled as a subcategory of audio event detection, where leitmotif activity is predicted at the frame level. However, as leitmotifs embody distinct, coherent musical structures, a more holistic approach akin to bounding box regression in visual object detection can be helpful. This method captures the entirety of a motif rather than fragmenting it into individual frames, thereby preserving its musical integrity and producing more useful predictions. We present our experimental results on tackling leitmotif detection as a boundary regression task.

Paper Structure

This paper contains 7 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Predictions from the proposed(c) and baseline(d) models. The heights of the bounding boxes in (c) are relative to confidence scores and do not relate to the frequency dimension.
  • Figure :