Abstract
Neural MIDI-to-audio synthesis is a task where given note melody of a specific instrument, realistic audio containing appropriate musical expressions is synthesized. Acoustic guitar possesses various performing techniques, which leads to a rich amount of musical expressions. In this work, we propose an end-to-end neural synthesizer based on diffusion-based generative model that could close the gap between MIDI and realistic guitar sound. We take advantage of the solid conditional nature of MIDI-to-audio synthesis task and propose an effective autoregressive continuation algorithm based on inpainting methods that have emerged in diffusion models. Furthermore, due to the lack of MIDI and audio pair datasets on acoustic guitar, we propose a large dataset where audio is synthesized based on virtual musical instruments and pre-train the model on this dataset in the context of transfer learning.