Table of Contents
Fetching ...

Aligned Music Notation and Lyrics Transcription

Eliseo Fuentes-Martínez, Antonio Ríos-Vila, Juan C. Martinez-Sevilla, David Rizo, Jorge Calvo-Zaragoza

TL;DR

The AMNLT framework tackles the joint transcription and alignment of vocal music notation and lyrics, formalizing a task that preserves symbol-lyrics synchronization for vocal scores. It surveys a spectrum of approaches from traditional divide-and-conquer pipelines to end-to-end models with transcription, unfolding mechanisms, and language modeling. Four Gregorian chant datasets (real and synthetic) are introduced with custom metrics to evaluate both transcription accuracy and alignment quality. Experimental results indicate end-to-end methods generally outperform heuristic baselines, with language models showing the strongest gains when ample training data is available, establishing a foundational framework for digitizing vocal music heritage.

Abstract

The digitization of vocal music scores presents unique challenges that go beyond traditional Optical Music Recognition (OMR) and Optical Character Recognition (OCR), as it necessitates preserving the critical alignment between music notation and lyrics. This alignment is essential for proper interpretation and processing in practical applications. This paper introduces and formalizes, for the first time, the Aligned Music Notation and Lyrics Transcription (AMNLT) challenge, which addresses the complete transcription of vocal scores by jointly considering music symbols, lyrics, and their synchronization. We analyze different approaches to address this challenge, ranging from traditional divide-and-conquer methods that handle music and lyrics separately, to novel end-to-end solutions including direct transcription, unfolding mechanisms, and language modeling. To evaluate these methods, we introduce four datasets of Gregorian chants, comprising both real and synthetic sources, along with custom metrics specifically designed to assess both transcription and alignment accuracy. Our experimental results demonstrate that end-to-end approaches generally outperform heuristic methods in the alignment challenge, with language models showing particular promise in scenarios where sufficient training data is available. This work establishes the first comprehensive framework for AMNLT, providing both theoretical foundations and practical solutions for preserving and digitizing vocal music heritage.

Aligned Music Notation and Lyrics Transcription

TL;DR

The AMNLT framework tackles the joint transcription and alignment of vocal music notation and lyrics, formalizing a task that preserves symbol-lyrics synchronization for vocal scores. It surveys a spectrum of approaches from traditional divide-and-conquer pipelines to end-to-end models with transcription, unfolding mechanisms, and language modeling. Four Gregorian chant datasets (real and synthetic) are introduced with custom metrics to evaluate both transcription accuracy and alignment quality. Experimental results indicate end-to-end methods generally outperform heuristic baselines, with language models showing the strongest gains when ample training data is available, establishing a foundational framework for digitizing vocal music heritage.

Abstract

The digitization of vocal music scores presents unique challenges that go beyond traditional Optical Music Recognition (OMR) and Optical Character Recognition (OCR), as it necessitates preserving the critical alignment between music notation and lyrics. This alignment is essential for proper interpretation and processing in practical applications. This paper introduces and formalizes, for the first time, the Aligned Music Notation and Lyrics Transcription (AMNLT) challenge, which addresses the complete transcription of vocal scores by jointly considering music symbols, lyrics, and their synchronization. We analyze different approaches to address this challenge, ranging from traditional divide-and-conquer methods that handle music and lyrics separately, to novel end-to-end solutions including direct transcription, unfolding mechanisms, and language modeling. To evaluate these methods, we introduce four datasets of Gregorian chants, comprising both real and synthetic sources, along with custom metrics specifically designed to assess both transcription and alignment accuracy. Our experimental results demonstrate that end-to-end approaches generally outperform heuristic methods in the alignment challenge, with language models showing particular promise in scenarios where sufficient training data is available. This work establishes the first comprehensive framework for AMNLT, providing both theoretical foundations and practical solutions for preserving and digitizing vocal music heritage.

Paper Structure

This paper contains 3 sections.