Search

Singing Voice Synthesis

태그
Speech Processing
Creative AI
2 more properties

1. What is singing voice synthesis?

Singing voice synthesis (SVS) is the task of generating a natural singing voice from a given musical score. With the development of various deep generative models, research on synthesizing high-quality singing voices has been emerging recently. As the performance of the SVS improves, there are increasing cases in which the technology is applied to the production of actual music content.

2. Challenges

There are various challenges to designing a singing synthesis system that can freely generate high-quality, natural-sounding singing voices.

2.1 Dataset

First of all, the problem is that it is difficult to construct a dataset. In general, since singing is difficult to be released due to issues such as copyright, we have limitations in collecting public singing datasets. It is also difficult to access clean singing voices, as many are released with accompaniment. Lastly, singing synthesis requires not only a clean singing voice but also appropriate sheet music information corresponding to it, and the process of annotating it is time-consuming and costly.
In order to respond to this problem, 1) research on effectively modeling singing using only small data (LiteSing, Sinsy), 2) research on securing data sets using technologies such as sound source separation and automatic notation from various sound sources existing on the online web (DeepSinger), 3) Research on creating and presenting singing datasets free from copyright issues such as nursery rhymes is being conducted (CSD).
In our lab, we collect 200 songs and conduct research using them. First, we purchase an accompaniment MIDI file for K-POP music from a MIDI accompaniment producer, then hire an amateur singer to sing and record the song to the accompaniment. Later, by manually correcting minute differences in timing and pitch between the actual singing and the melody MIDI, the song and score pair data are obtained. Using this data, we are working on singing voice synthesis modeling while at the same time trying to obtain more sophisticated annotations through transcription and alignment studies.
Dataset example (audio, midi stereo)

2.2 Sound quality

With advances in speech synthesis studies, generating results of adequate quality in speech synthesis has advanced a lot. However, for singing synthesis technology to be used in real industry, studio quality results are required. Therefore, we are exploring different methods aimed at generating a 44 kHz sound source.
Unlike speech, singing voice 1) has a wide pitch range, 2) contains many notes with long duration, and 3) it is necessary to model a high sampling rate. We are trying to solve this problem based on the latest research on HiFiGAN, NSF, Parallel WaveGAN, etc. Applying the above studies focused on speech synthesis as it is causes several problems in high-quality singing modeling. (high, low pitch artifacts, glitch issue, etc.) Therefore, we are trying to develop a vocoder for singing voice that combines various GAN-based vocoders for high quality while taking the pitch robustness using the source excitation signal.
singing vocoder audio samples (on-going research)

2.3 Expressiveness & Controllability

With the development of singing synthesis studies, many studies have now made it possible to accurately interpret a given sheet music and produce it with high quality. However, in order to utilize it as a tool to help creators make music, it is important not only to simply sing accurately but also to reflect and control various styles. On the other hand, unlike clear elements such as lyrics and pitch, style is a more abstract concept and defining, modeling, and controlling it is a challenging problem.
As the most basic step to solve this problem, we designed a model that can reflect the characteristics of different speakers by designing a multi-speaker singing synthesis model. In this case, in particular, a methodology was introduced that can control the identity of the speaker by separating them into voice timbre and singing style. Therefore, using this model, it is possible to create a combination of the voices and characteristics of two different singers.
In addition, we considered how to control singing expression elements (breath, intensity, f0 contour, etc.) other than the speaker, and proposed a SVS system that can model and control various expression elements that are difficult to annotate by self-supervised manner in paper3.

3. SVS research in MARG

Abstract
In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the super-resolution network, and 3) conditional adversarial training. The proposed system consists of two main modules; a mel-synthesis network that generates a mel-spectrogram from the given input information, and a super-resolution network that upsamples the generated mel-spectrogram into a linear-spectrogram. In the mel-synthesis network, phonetic enhancement masking is applied to generate implicit formant masks solely from the input text, which enables a more accurate phonetic control of singing voice. In addition, we show that two other proposed methods - local conditioning of text and pitch, and conditional adversarial training - are crucial for a realistic generation of the human singing voice in the super-resolution process. Finally, both quantitative and qualitative evaluations are conducted, confirming the validity of all proposed methods.
Abstract
In this study, we define the identity of the singer with two independent concepts -- timbre and singing style -- and propose a multi-singer singing synthesis system that can model them separately. To this end, we extend our single-singer model into a multi-singer model in the following ways: first, we design a singer identity encoder that can adequately reflect the identity of a singer. Second, we use encoded singer identity to condition the two independent decoders that model timbre and singing style, respectively. Through a user study with the listening tests, we experimentally verify that the proposed framework is capable of generating a natural singing voice of high quality while independently controlling the timbre and singing style. Also, by using the method of changing singing styles while fixing the timbre, we suggest that our proposed network can produce a more expressive singing voice.
Paper 3 Expressive singing synthesis using local style token and dual-path pitch encoder (submitted to ICASSP 2022)
Abstract
In this paper, we propose a controllable singing voice synthesis system capable of generating expressive singing voice with two novel methodologies. First, a local style token module, which predicts frame-level style tokens from an input pitch and text sequence, is proposed to allow the singing voice system to control musical expression that is often unspecified in sheet music (e.g., breathing and intensity). Second, we propose a dual-path pitch encoder with a choice of two different pitch inputs: MIDI pitch sequence or f0 contour. Because the initial generation of a singing voice is usually executed by taking a MIDI pitch sequence, one can later extract an f0 contour from the generated singing voice and modify the f0 contour to a finer level as desired. Through quantitative and qualitative evaluations, we confirmed that the proposed model can control various musical expressions while not sacrificing the sound quality of the singing voice system.
paper (coming soon)
SVS in media

Contact

juheon2@snu.ac.kr (Juheon Lee, Ph.D Candidate)