Abstract
Recent piano transcription algorithms have reached a high level of note estimation accuracy. Most research concentrates on offline transcription with bidirectional RNN and deals with the online transcription model as its subordinate case. However, we questioned the need for unlimited backward context because online transcription is more similar to how humans listen to music.
We showed that the online model could accomplish state-of-the-art as the best offline model by using an autoregressive model and specialized structure.
We propose a pitchwise RNN that only considers the state relationship of each pitch. The pitchwise RNN also takes advantage of the autoregressive model, as it is appropriate for modeling note state changes at the frame level.
In addition, We use a pitch-conditional FiLM layer to adapt the filter of the acoustic model to frequency flexibly.
To overcome the autoregressive model's slow training procedure and exposure bias, we augmented recursive context from ground truth in musical ways instead of sampling during the training process.
We propose two models. The first is a model that aims for high performance,
%% (\#N $\simeq$ 30.2M)
and the second is a compact model (\#N $\simeq$ 2.7M) that aims to reduce weight. Surprisingly, the smallest version of the compact model was able to achieve an Onset F1 of 0.85 with only a small parameter of 24K. In addition to the typical evaluation methods, we present the results of an in-depth analysis of false predictions.
We compare our model with other state-of-the-art models on several datasets, including the MAESTRO dataset.
Through experiments, we have shown that our proposed modules are effective in overall performance, especially for the offset estimation (97.0\% / 87.9\% for Onset / Offset F1). The presented results can be used as benchmarks and enhance understanding of general music transcription.