Table of Contents
Fetching ...

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

Hao Hao Tan, Kin Wai Cheuk, Taemin Cho, Wei-Hsiang Liao, Yuki Mitsufuji

Abstract

This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models.

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

Abstract

This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models.
Paper Structure (23 sections, 6 equations, 5 figures, 4 tables)

This paper contains 23 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An example of the instrument leakage issue. In the MIDI transcribed by MT3 (left), we often observe musical notes intended for a specific instrument to "leak" across multiple instruments, leading to a cluttered arrangement as compared to the ground truth (middle). We also demonstrate a transcription example which under-predicts the number of instruments (right).
  • Figure 2: Our proposed model architecture for MR-MT3. A memory retention mechanism is introduced to aggregate tokens transcribed from the previous segment (yellow). It is concatenated to the encoder outputs for cross-attention during autoregressive token sampling of the current segment (green).
  • Figure 3: The memory retention mechanism. The aggregated token representation is its self-attention output, truncated at length $L_\text{agg}$.
  • Figure 4: Segmentation workflow to obtain training pairs following MT3. In addition, we propose to use the prior frames (yellow) to inform the transcription of the current frames (green). Prior frames can start from up to $L_\text{max\_hop} \times N_{f}$ before the current frames. In the above example, $L_\text{max\_hop} = 3$.
  • Figure 5: Event token shuffling as a data augmentation method.