Abstract
Timbre plays a crucial role in music. In digital music production, the desired timbre is obtained via time-consuming selection from extensive virtual instrument libraries or complex synthesizer parameter tuning. In this work, we introduce a neural synthesizer called TIMI, whose timbre can be tuned in an intuitive way. TIMI’s timbre can be conditioned by (i) an audio prompt comprising about five seconds of audio from any single instrument to clone, (ii) a free-form text prompt articulating the desired timbre or instrument, or (iii) a fusion of multiple audio/text prompts. Structured as a decoder-only transformer model, TIMI operates on a CLAP embedding containing timbre information and MIDI tokens, generating audio tokens that are subsequently decoded into audio by a neural audio codec model's decoder. Preliminary results demonstrate that TIMI can robustly synthesize musical audio adhering to the provided MIDI and CLAP-encoded timbre information.