Abstract
Musical interpretation, when expressed in free-form text as opposed to more structured and limited musical tags, often encompasses the individual characteristics of the annotator, thereby injecting a degree of subjectivity into the resultant music caption dataset. This study explores the impact of such annotator subjectivity within the MusicCaps dataset, a pioneering collection of human-annotated captions explaining 10-second music audio clips.
We made three different examinations to detect and analyze this subjectivity. This includes examining the frequency distribution of tag categories (i.e., genre, mood, or instruments) among different annotators, a qualitative assessment of caption embeddings through UMAP visualizations, and a quantitative analysis where we train and compare cross-modal retrieval models using an author-specified training split.
Our findings underscore the significant annotator subjectivity inherent in the MusicCaps dataset, emphasizing the need for its consideration when collecting free-form text annotations on music or developing machine-learning models using this type of dataset.