Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models

Taegyun Kwon; Dasaem Jeong; Juhan Nam

Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models

Taegyun Kwon, Dasaem Jeong, Juhan Nam

TL;DR

The paper addresses real-time piano transcription using neural autoregressive models, focusing on lightweight, low-latency deployment. It introduces PAR, a Pitchwise AutoRegressive architecture with a frequency-conditioned FiLM acoustic module, pitchwise LSTM for intra-pitch transitions, and Enhanced Recursive Context to incorporate velocity and duration. Two models are proposed: PAR for high performance and PAR_compact for compact deployment, both achieving competitive note-level accuracy on MAESTRO while maintaining real-time latency. Ablation studies confirm the crucial role of the pitchwise LSTM and autoregressive connections, with cross-dataset tests showing reasonable generalization to unseen piano data. The work advances practical online piano transcription and informs design choices for note length and pitch-range behavior.

Abstract

In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time inference for piano transcription while ensuring both high performance and lightweight. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset. We also investigate the effective model size and real-time inference latency by gradually streamlining the architecture. Finally, we conduct cross-data evaluation on unseen piano datasets and in-depth analysis to elucidate the effect of the proposed components in the view of note length and pitch range.

Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 8 figures, 5 tables)

This paper contains 30 sections, 2 equations, 8 figures, 5 tables.

Introduction
Related Work
Acoustic Model
Musical Language Model
Method
Model Overview
Acoustic Module
Frequency-conditioned FiLM
Note-State Sequence Module
Pitch-wise LSTM
Enhanced Recursive Context
Compact Model
Experiments
Datasets
MAESTRO hawthorne2018enabling
...and 15 more sections

Figures (8)

Figure 1: Frame-level note states and their relations across different pitches. (a) A simplified amplitude envelope of two successive notes while the sustain pedal is pressed. (b) Multi-state representation of notes. In our model, each note is in one of the five states. (c) Intra-pitch relation considers note states within a single pitch only, (d) Inter-pitch relation considers note states across all pitches. We constrained our model to have only intra-pitch relations by removing the inter-pitch link in LSTM.
Figure 2: Overall system diagram. A step prediction of the system is illustrated with the context window required for single step. The system takes $6$ more frames in forward direction, which corresponds to 320 ms. The output of the acoustic model are split into 88 segments, corresponding to 88 pitches. Each segment is processed separately and then combined together. For simplicity, velocity modules are omitted.
Figure 3: Model Diagram. The gray boxes indicate feature maps, and the rounded white boxes indicate operations. The numbers in the Conv2d block refer to ($channels, filter\ width, filter\ height$) respectively. The Timewise FC refers to the application of a fully connected layer at each time step. The PAR$_{Compact}$ model differs from the PAR model only in the middle part, as it reduces the number of channels by applying 1x1 convolution, and splits the feature maps into 88 pitches earlier to avoid having fully connected layers with a large number of parameters.
Figure 4: Enhanced Recursive Context. The left piano roll depicts notes states at the specific step $t-1$ and the highlighted areas indicate the context being considered. In kwon2020polyphonic, only the last states were considered as a context and converted to a 2-dimensional embedding per pitch. In the Enhanced Recursive Context setting, we additionally consider note velocity and note duration. Also, the contexts are encoded into a 4-dimensional embedding.
Figure 5: Model Parameters vs. Note F1 Score.
...and 3 more figures

Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models

TL;DR

Abstract

Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)