Abstract
The singing voice is the use of a human voice as an element of musical expression. In various genres, such as vocal music, musicals, and popular music, human voices are used as important instruments. Singing voice has a variety of expressions depending on how singers make sounds. Thus, singing voice is classified according to vocal techniques, especially in the field of vocal education. However, vocal education often relies on the discretion of an instructor, even in a university. Therefore, this paper presents a deep neural network system for evaluating human vocalization and the Vocalization Dataset for training such models. While the previous quantitative evaluation of singing voice focused on the precision of pitch and rhythm, this paper proposes another criterion for the evaluation: the quality of vocalization. Two advisors and fourteen subjects were recruited to proceed with the construction of the Vocalization Dataset.' The Vocalization Dataset' consists of about 1 hour and 40 minutes for each person. In addition, `Vocalization Dataset' is annotated with various labels. The lables are gender average vocal range {(male C3-B4), (female C4-F5)}, vowel pronunciation (A, E, I, O, U), four vocal types (Good, Hyperventilation, Physical Tension, Excessive Pressure of Vocal Cord), and microphones with different conditions, such as AKG_C414, Apple_iphoneXS, Neumann_KM184, Royer Labs_R-121. The dataset showed 84.17% accuracy in the semantic evaluation for general people and 64.54% accuracy in the evaluation with a deep neural network.