Music Auto-tagging in Multimedia Content using Robust Music Representation Learned via Domain Adversarial Training

Affiliation

MARG

Presenter

정해선

Personal Link

https://scholar.google.com/citations?hl=en&view_op=list_works&authuser=1&gmla=AMpAcmQqn-mFKDlJ9gI2Yg-OvPzP8REJeW4TbmD7uUd9vuZ1J2KrHnwui3XDTvDUsV1NSk-31-BTtWSecvtlzw&user=yV8xVKoAAAAJ

Subject

Music Representation Learning

Music Auto-tagging

Site

A13

Time

Poster Session I - 11:10~12:30

1 more property

Abstract

Music auto-tagging plays a vital role in music discovery and recommendation by assigning relevant tags or labels to music tracks. However, existing models in the field of Music Information Retrieval (MIR) often struggle to maintain high performance when faced with real-world noise, such as environmental noise and speech commonly found in multimedia content like YouTube videos.

In this research, we draw inspiration from previous studies focused on speech-related tasks and propose a novel approach to improve the performance of music auto-tagging on noisy sources. Our method incorporates Domain Adversarial Training (DAT) into the music domain, enabling the learning of robust music representations that are resilient to the presence of noise. Unlike previous speech-based research, which typically involves a pretraining phase for the feature extractor followed by the DAT phase, our approach includes an additional pretraining phase specifically designed for the domain classifier. By this additional training phase, the domain classifier effectively distinguishes between clean and noisy music sources, enhancing the feature extractor’s ability not to distinguish between clean and noisy music.

Furthermore, we introduce the concept of creating noisy music source data with varying signal-to-noise ratios. By exposing the model to different levels of noise, we promote better generalization across diverse environmental conditions. This enables the model to adapt to a wide range of real-world scenarios and perform robust music auto-tagging.

Our proposed network architecture demonstrates exceptional performance in music auto-tagging tasks, leveraging the power of robust music representations even on noise types that were not encountered during the training phase. This highlights the model’s ability to generalize well to unseen noise sources, further enhancing its effectiveness in real-world applications.

Through this research, we address the limitations of existing music auto-tagging models and present a novel approach that significantly improves performance in the presence of noise. The findings of this study contribute to the advancement of music processing applications, enabling more accurate and reliable music classification and organization in various industries.