Take It or Make It: A Neural Synthesizer for Instrument Timbre Cloning and Text-to-Instrument Generation

Affiliation

MARG

Presenter

김경수

Personal Link

https://scholar.google.co.kr/citations?user=bCMZWFIAAAAJ&hl=en

Time

IV / 17:00~17:15

Subject

Neural Synthesizers

Timbre

Abstract

Timbre plays a crucial role in music. In digital music production, the desired timbre is obtained via time-consuming selection from extensive virtual instrument libraries or complex synthesizer parameter tuning. In this work, we introduce a neural synthesizer called TIMI, whose timbre can be tuned in an intuitive way. TIMI’s timbre can be conditioned by (i) an audio prompt comprising about five seconds of audio from any single instrument to clone, (ii) a free-form text prompt articulating the desired timbre or instrument, or (iii) a fusion of multiple audio/text prompts. Structured as a decoder-only transformer model, TIMI operates on a CLAP embedding containing timbre information and MIDI tokens, generating audio tokens that are subsequently decoded into audio by a neural audio codec model's decoder. Preliminary results demonstrate that TIMI can robustly synthesize musical audio adhering to the provided MIDI and CLAP-encoded timbre information.