Music Production is the process of managing and overseeing the recording and production of music tracks. Intelligent Music Production (IMP) aims to introduce artificial intelligence into music production, allowing for automated processes while also reflecting the user's preferences. We believe that future studies on this topic will help artists and audio engineers simplify the complex process of traditional music production and provide them with a creative workflow.
Table of Contents
Our Interests Related to IMP
Music Post Production
Music production includes various sub-processes, from transforming each audio track with audio effects to mixing and mastering. In modern music production, most recorded tracks are digitally processed so that audio engineers seek a more convenient yet intricate digital tool for their workflow. IMP should therefore introduce
1.
straightforward procedures of the overall music production,
our publication: End-to-end Music Remastering System Using Self-supervised And Adversarial Training.
2.
new digital audio effects (DAFX) that previous approaches have not been capable of,
our publication: Reverb Conversion of Mixed Vocal Tracks Using an End-to-end Convolutional Deep Neural Network.
3.
and improving the availability and accessibility of existing complex DAFX modules.
our publication: Differentiable Artificial Reverberation.
3D Audio
3D sound, which consists of Binaural Spatial Audio, takes normal sound effects and processes them so that the sounds are virtually placed anywhere in the three-dimensional space around the listener. Thus, it allows audio engineers to create immersive audio experiences with multidimensional sound. However, the Binaural monitoring or listening process still needs to be discovered in the human brain's auditory system - functional neuroanatomy.
Example of an application for the reproduction of binaural sound from monaural sound
Speech | Music Enhancement
Without a controlled acoustic setup, speech and music signals get degraded during the transmission process, including noise injection, undesired reverberation, distortion, and many more. Speech and music enhancement methods aim to reverse the process; they extract or recover the original signal from the degraded one. While we primarily focus on traditional speech enhancement scenarios, i.e., enhancing speech intelligibility during communication, we also address other "reverse-DAFX" issues, such as vocal track de-reverberation and de-clipping and de-compression of already processed signals. Also, we have been developing a huge speech & singing voice enhancement network that can pre-process various crowd-sourced datasets to make them "studio-quality".
IMP Related Publications in MARG
•
Reverb Conversion of Mixed Vocal Tracks Using an End-to-end Convolutional Deep Neural Network (ICASSP 2021)
◦
Abstract - Reverb plays a critical role in music production, where it provides listeners with spatial realization, timbre, and texture of the music. Yet, it is challenging to reproduce the musical reverb of a reference music track even by skilled engineers. In response, we propose an end-to-end system capable of switching the musical reverb factor of two different mixed vocal tracks. This method enables us to apply the reverb of the reference track to the source track to which the effect is desired. Further, our model can perform de-reverberation when the reference track is used as a dry vocal source. The proposed model is trained in combination with an adversarial objective, which makes it possible to handle high-resolution audio samples. The perceptual evaluation confirmed that the proposed model can convert the reverb factor with the preferred rate of 64.8%. To the best of our knowledge, this is the first attempt to apply deep neural networks to converting music reverb of vocal tracks.
•
End-to-end Music Remastering System Using Self-supervised And Adversarial Training (ICASSP 2022)
◦
Abstract - Mastering is an essential step in music production, but it is also a challenging task that has to go through the hands of experienced audio engineers, where they adjust tone, space, and volume of a song. Remastering follows the same technical process, in which the context lies in mastering a song for the times. As these tasks have high entry barriers, we aim to lower the barriers by proposing an end-to-end music remastering system that transforms the mastering style of input audio to that of the target. The system is trained in a self-supervised manner, in which released pop songs were used for training. We also anticipated the model to generate realistic audio reflecting the reference’s mastering style by applying a pre-trained encoder and a projection discriminator. We validate our results with quantitative metrics and a subjective listening test and show that the model generated samples of mastering style similar to the target.
•
Differentiable Artificial Reverberation (TASLP)
◦
Abstract - Artificial reverberation (AR) models play a central role in various audio applications. Therefore, estimating the AR model parameters (ARPs) of a target reverberation is a crucial task. Although a few recent deep-learning-based approaches have shown promising performance, their non-end-to-end training scheme prevents them from fully exploiting the potential of deep neural networks. This motivates to introduce differentiable artificial reverberation (DAR) models which allows loss gradients to be back-propagated end-to-end. However, implementing the AR models with their difference equations "as is" in the deep-learning framework severely bottlenecks the training speed when executed with a parallel processor like GPU due to their infinite impulse response (IIR) components. We tackle this problem by replacing the IIR filters with finite impulse response (FIR) approximations with the frequency-sampling method (FSM). Using the FSM, we implement three DAR models---differentiable Filtered Velvet Noise (FVN), Advanced Filtered Velvet Noise (AFVN), and Feedback Delay Network (FDN). For each AR model, we train its ARP estimation networks for analysis-synthesis (RIR-to-ARP) and blind estimation (reverberant-speech-to-ARP) task in an end-to-end manner with its DAR model counterpart. Experiment results show that the proposed method achieves consistent performance improvement over the non-end-to-end approaches in both objective metrics and subjective listening test results.
•
Blind Estimation of Audio Processing Graph (ICASSP 2023)
◦
Abstract - Musicians and audio engineers sculpt and transform their sounds by connecting multiple processors, forming an audio processing graph. However, most deep-learning methods overlook this real-world practice and assume fixed graph settings. To bridge this gap, we develop a system that reconstructs the entire graph from a given reference audio. We first generate a realistic graph-reference pair dataset and train a simple blind estimation system composed of a convolutional reference encoder and a transformer-based graph decoder. We apply our model to singing voice effects and drum mixing estimation tasks. Evaluation results show that our method can reconstruct complex signal routings, including multi-band processing and sidechaining.
•
Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects (ICASSP 2023)
◦
Abstract - We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording. All our models are trained in a self-supervised manner from an already-processed wet multitrack dataset with an effective data preprocessing method that alleviates the data scarcity of obtaining unprocessed dry data. We analyze the proposed encoder for the disentanglement capability of audio effects and also validate its performance for mixing style transfer through both objective and subjective evaluations. From the results, we show the proposed system not only converts the mixing style of multitrack audio close to a reference but is also robust with mixture-wise style transfer upon using a music source separation model.
Contact
•
•