Search

Toward Universal Singing Voice Understanding Leveraging Foundation Models

Affiliation
MAC
Presenter
Yuya Yamamoto
Subject
Unified singing voice understanding
Site
B8
Time
Poster Session II - 13:30~15:00
1 more property

Abstract

Automatic singing voice understanding tasks, such as singer identification, singing voice transcription, and singing technique classification, etc. benefit from data-driven approaches that utilize deep learning techniques. These approaches work well even under the rich diversity of vocal and noisy samples owing to their representation ability. However, in the singing voice domain, the limited availability of labeled data remains a significant obstacle to achieving satisfactory performance. In recent years, pretrained self-supervised learning models (SSL models), which pretrained on large amounts of unlabeled data, have demonstrated remarkable performance in the field of speech processing and music classification. By fine-tuning these models for the target tasks, comparable performance to conventional supervised learning can be achieved even with limited training data. In this presentation, I explore the utility of such SSL models for singing voice tasks. (APSIPA 2023): To investigate the utility and the behaviour of SSL models, I adopt four models; Wav2Vec2.0, WavLM, MERT, and Map-Music2Vec for three different tasks; singer identification, singing voice transcription, and singing technique classification. (CMMR 2023 demo): To make the models more suited, I investigate the utility of Adapter, which is another way to finetune the model with small learnable parameters. The comparison is demonstrated between full-finetuning and adapter.