Abstract
The goal of studio-quality speech enhancement is to improve the quality of degraded speech and singing signals. In previous studies, researchers attempted to address this issue by employing a conditional diffusion model. However, this model's stochastic process aligns with the prior distribution only in infinite-time scenarios, thereby offering only approximate solutions. Furthermore, this approach necessitates the selection of suitable hyperparameters for defining the stochastic differential equation (SDE) and training the model. To overcome these limitations, we propose a more versatile framework utilizing stochastic interpolants.