Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

Jinyi Mi; Sehun Kim; Tomoki Toda

Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

Jinyi Mi, Sehun Kim, Tomoki Toda

TL;DR

Automatic music transcription, especially for piano, faces harmonic misdetections and scalable resource demands when using high-resolution targets. The authors replace the STFT-based front-end with a Constant-Q Transform and introduce two lightweight architectures: HRplus (CRNN with dilated convolutions) and HRplus-hybrid (CRNN encoder with a non-autoregressive Transformer decoder). Both architectures deliver consistent note-level improvements on MAESTRO v2 and v3 while using far fewer parameters (2.7M and 0.9M vs 20M). This yields accurate, efficient AMT suitable for deployment, with potential for extension to other instruments via transfer learning.

Abstract

Automatic music transcription (AMT), aiming to convert musical signals into musical notation, is one of the important tasks in music information retrieval. Recently, previous works have applied high-resolution labels, i.e., the continuous onset and offset times of piano notes, as training targets, achieving substantial improvements in transcription performance. However, there still remain some issues to be addressed, e.g., the harmonics of notes are sometimes recognized as false positive notes, and the size of AMT model tends to be larger to improve the transcription performance. To address these issues, we propose an improved high-resolution piano transcription model to well capture specific acoustic characteristics of music signals. First, we employ the Constant-Q Transform as the input representation to better adapt to musical signals. Moreover, we have designed two architectures: the first is based on a convolutional recurrent neural network (CRNN) with dilated convolution, and the second is an encoder-decoder architecture that combines CRNN with a non-autoregressive Transformer decoder. We conduct systematic experiments for our models. Compared to the high-resolution AMT system used as a baseline, our models effectively achieve 1) consistent improvement in note-level metrics, and 2) the significant smaller model size, which shed lights on future work.

Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 3 figures, 2 tables)

This paper contains 17 sections, 6 equations, 3 figures, 2 tables.

Introduction
Related work
Onset and offset times detection
High-resolution piano transcription system
Proposed method
Improved CRNN Model for the High-Resolution System
Inputs and Outputs
Acoustic Model for CQT Input Representation
Hybrid CRNN-Transformer Encoder-Decoder Model
Loss functions
Experimental Evaluations
Datasets
Experimental Setup
Baselines
Evaluation Metrics
...and 2 more sections

Figures (3)

Figure 1: The high-resolution model.
Figure 2: Illustration of HRplus model architecture. Norm denotes instance normalization with a ReLu activation, $d$ denotes the dilation rate.
Figure 3: Illustration of HRplus-hybrid model architecture. FC denotes a fully connected layer, $\sigma$ denotes a sigmoid function.

Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

TL;DR

Abstract

Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

Authors

TL;DR

Abstract

Table of Contents

Figures (3)