Abstract
We propose a method for time-alignment of lyrics in Korean folk song audio using a transformer encoder-decoder model that internally leverages incomplete lyric information. We conducted an analysis of the characteristics of Korean folk song lyrics and identified some discrepancies between the lyrics and their corresponding audio recordings. To handle these issues while fully exploiting existing transcription, we propose RefWhisper, a variation of Whisper by OpenAI, which has an additional encoder module and cross-attention layer so that the model can refer to an incomplete lyric text while making transcription. The additional cross-attention layer also enables the alignment between the reference text and the predicted transcription and also the audio. Furthermore, we publicly release the transcribed results and timestamp data, aligned at the sentence and word levels, for 14,627 Korean folk songs.