Table of Contents
Fetching ...

Note-Level Singing Melody Transcription for Time-Aligned Musical Score Generation

Leekyung Kim, Sungwook Jeon, Wan Heo, Jonghun Park

TL;DR

This paper extends note-level singing melody transcription to directly generate time-aligned musical scores by jointly predicting onset, offset, pitch, and note value from audio using a Transformer-based end-to-end framework. It introduces a dedicated tokenization scheme and a pseudo-labeling approach to overcome scarce note-value annotations, along with novel evaluation metrics for time-aligned note values. Empirical results on ST500 and HSD demonstrate that the proposed T3MS model achieves superior note-level transcription and note-value recognition compared with state-of-the-art baselines, while enabling direct score visualization. The work advances practical automatic score generation from audio, with potential applications in music education, analysis, and retrieval, and points to future work on triplet rhythms and varying time signatures.

Abstract

Automatic music transcription converts audio recordings into symbolic representations, facilitating music analysis, retrieval, and generation. A musical note is characterized by pitch, onset, and offset in an audio domain, whereas it is defined in terms of pitch and note value in a musical score domain. A time-aligned score, derived from timing information along with pitch and note value, allows matching a part of the score with the corresponding part of the music audio, enabling various applications. In this paper, we consider an extended version of the traditional note-level transcription task that recognizes onset, offset, and pitch, through including extraction of additional note value to generate a time-aligned score from an audio input. To address this new challenge, we propose an end-to-end framework that integrates recognition of the note value, pitch, and temporal information. This approach avoids error accumulation inherent in multi-stage methods and enhances accuracy through mutual reinforcement. Our framework employs tokenized representations specifically targeted for this task, through incorporating note value information. Furthermore, we introduce a pseudo-labeling technique to address a scarcity problem of annotated note value data. This technique produces approximate note value labels from existing datasets for the traditional note-level transcription. Experimental results demonstrate the superior performance of the proposed model in note-level transcription tasks when compared to existing state-of-the-art approaches. We also introduce new evaluation metrics that assess both temporal and note value aspects to demonstrate the robustness of the model. Moreover, qualitative assessments via visualized musical scores confirmed the effectiveness of our model in capturing the note values.

Note-Level Singing Melody Transcription for Time-Aligned Musical Score Generation

TL;DR

This paper extends note-level singing melody transcription to directly generate time-aligned musical scores by jointly predicting onset, offset, pitch, and note value from audio using a Transformer-based end-to-end framework. It introduces a dedicated tokenization scheme and a pseudo-labeling approach to overcome scarce note-value annotations, along with novel evaluation metrics for time-aligned note values. Empirical results on ST500 and HSD demonstrate that the proposed T3MS model achieves superior note-level transcription and note-value recognition compared with state-of-the-art baselines, while enabling direct score visualization. The work advances practical automatic score generation from audio, with potential applications in music education, analysis, and retrieval, and points to future work on triplet rhythms and varying time signatures.

Abstract

Automatic music transcription converts audio recordings into symbolic representations, facilitating music analysis, retrieval, and generation. A musical note is characterized by pitch, onset, and offset in an audio domain, whereas it is defined in terms of pitch and note value in a musical score domain. A time-aligned score, derived from timing information along with pitch and note value, allows matching a part of the score with the corresponding part of the music audio, enabling various applications. In this paper, we consider an extended version of the traditional note-level transcription task that recognizes onset, offset, and pitch, through including extraction of additional note value to generate a time-aligned score from an audio input. To address this new challenge, we propose an end-to-end framework that integrates recognition of the note value, pitch, and temporal information. This approach avoids error accumulation inherent in multi-stage methods and enhances accuracy through mutual reinforcement. Our framework employs tokenized representations specifically targeted for this task, through incorporating note value information. Furthermore, we introduce a pseudo-labeling technique to address a scarcity problem of annotated note value data. This technique produces approximate note value labels from existing datasets for the traditional note-level transcription. Experimental results demonstrate the superior performance of the proposed model in note-level transcription tasks when compared to existing state-of-the-art approaches. We also introduce new evaluation metrics that assess both temporal and note value aspects to demonstrate the robustness of the model. Moreover, qualitative assessments via visualized musical scores confirmed the effectiveness of our model in capturing the note values.

Paper Structure

This paper contains 25 sections, 6 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: An overview of the extended note-level singing melody transcription task proposed in this paper.
  • Figure 2: Proposed framework. (a) Note values were pseudo-labeled using the labels of the dataset for the note-level transcription and the beat tracking results obtained from an open-source library. Subsequently, an end-to-end model was trained using the audio along with the labels from the dataset for the note-level transcription incorporating the pseudo-labeled note values. (b) The trained model is capable of recognizing the onset, offset, pitch, and note value at once during inference. These recognition results enable the generation of a time-aligned score from an audio input directly.
  • Figure 3: Distribution of pseudo-labeling outcomes using madmom and Beat-Transformer.
  • Figure 4: Tokenization example. Notes in an audio segment are represented as a token sequence. A note is represented by four tokens: two for time, one for pitch, and one for note value. The start-of-sequence token ($\langle SOS \rangle$) and the end-of-sequence token ($\langle EOS \rangle$) are added to the beginning and the end of the sequence, respectively.
  • Figure 5: Example of segmentation where a note offset exists without its corresponding note onset. The musical score is shown at the top, and the piano-roll representation is shown at the bottom. The numbers in the corresponding piano-roll represent the note values. The note marked in orange is split when the audio is divided into segments of $T$ seconds. The preceding fragment only contains the note onset, while the following fragment only contains the note offset.
  • ...and 5 more figures