Table of Contents
Fetching ...

Musical Phrase Segmentation via Grammatical Induction

Reed Perkins, Dan Ventura

TL;DR

The paper investigates musical phrase segmentation via grammatical induction, evaluating five grammar-learning algorithms across three datasets using multiple viewpoint representations of music. It demonstrates that the offline LongestFirst method, when paired with duration-focused viewpoint encodings, delivers the strongest phrase-detection performance and reveals hierarchical structures in learned grammars. The work provides a principled, data-driven approach to segmenting symbolic musical sequences and offers insights into how rhythmic and transpositional invariances affect segmentation quality, with implications for data-driven procedural music generation. Overall, it advances understanding of how context-free grammars can model musical phrases and how viewpoint design shapes segmentation outcomes.

Abstract

We outline a solution to the challenge of musical phrase segmentation that uses grammatical induction algorithms, a class of algorithms which infer a context-free grammar from an input sequence. We analyze the performance of five grammatical induction algorithms on three datasets using various musical viewpoint combinations. Our experiments show that the LONGESTFIRST algorithm achieves the best F1 scores across all three datasets and that input encodings that include the duration viewpoint result in the best performance.

Musical Phrase Segmentation via Grammatical Induction

TL;DR

The paper investigates musical phrase segmentation via grammatical induction, evaluating five grammar-learning algorithms across three datasets using multiple viewpoint representations of music. It demonstrates that the offline LongestFirst method, when paired with duration-focused viewpoint encodings, delivers the strongest phrase-detection performance and reveals hierarchical structures in learned grammars. The work provides a principled, data-driven approach to segmenting symbolic musical sequences and offers insights into how rhythmic and transpositional invariances affect segmentation quality, with implications for data-driven procedural music generation. Overall, it advances understanding of how context-free grammars can model musical phrases and how viewpoint design shapes segmentation outcomes.

Abstract

We outline a solution to the challenge of musical phrase segmentation that uses grammatical induction algorithms, a class of algorithms which infer a context-free grammar from an input sequence. We analyze the performance of five grammatical induction algorithms on three datasets using various musical viewpoint combinations. Our experiments show that the LONGESTFIRST algorithm achieves the best F1 scores across all three datasets and that input encodings that include the duration viewpoint result in the best performance.
Paper Structure (20 sections, 9 equations, 8 figures, 4 algorithms)

This paper contains 20 sections, 9 equations, 8 figures, 4 algorithms.

Figures (8)

  • Figure 1: The grammar $G$ for Hymn 2 generated by Sequitur using the duration viewpoint. Numbers encased in angle brackets are the feature vectors $\phi_i$ which each contain a single duration value. These feature vectors are also the terminal symbols for $G$.
  • Figure 2: Representation of the note event sequence $\omega^\pazocal{E}$ for Hymn 5. (a) shows a visual rendering of the musical sequence and (b) shows the note event representation. A column in (b) shows the note event $e_i$ that is equivalent to the 3-tuple $(o, p, d)$ where $o$, $p$ and $d$ represent the onset time, the midi pitch value, and duration (in quarter notes). A note event sequence $\omega^\pazocal{E}$ is comprised of note events $e_1, e_2, ..., e_n$.
  • Figure 3: Ground truth phrase annotation $P$ for Hymn 2 given by Annotator 1 represented textually (a) and visually (b). Each phrase $p \in P$ was given an identifying label and contains one or more occurrences. Each occurrence $o$ of the phrase $p_i$ is equivalent to the range $[x_1,\ x_2]$ such that $x_1 < x_2 \in \mathbb{N}$, where $x_1$ and $x_2$ represent the starting and ending indices of that particular phrase occurrence. In (b), each distinct notehead represents a note event $e_i$ at index $i$ (0-based). Phrases that contain 2 or more occurrences are known as patterns --- in this example, Phrases A and B are considered patterns.
  • Figure 4: Representation of VCI-31, which is equal to { pitch, duration, ioi, pitchC, pitchI }, for Hymn 5. The top (a) and middle (b) figures are repeated from Figure \ref{['fig:noteevents']}. A column in (c) shows each element of the feature vector $\phi_i$, where $i$ is indicated by the index. For example, $\phi_0 = \langle 60, 1.0, \bot, \bot, \bot \rangle$ and $\phi_9 = \langle 70, 0.5, 0.5, -1, -2 \rangle$. An algorithm is given an entire sequence of feature vectors $\omega^\Phi = \phi_0, \phi_1, ..., \phi_n$ as input.
  • Figure 5: Summary of F1 scores for each algorithm and dataset. Each individual plot shows the average F1 score (y axis) obtained by a given algorithm on a given dataset when using each viewpoint combination (x axis). Each plot is annotated with the mean --- denoted as $\mu$ and the dashed black line --- and variance (denoted as $\sigma$) of all F1 scores across all viewpoints for that particular algorithm and dataset combination. The columns of the figure correspond to the dataset used, and the rows correspond to the algorithm used.
  • ...and 3 more figures