Towards Controllable & Interactive Generative AI for Voice

Affiliation

Supertone

Presenter

최형석

Personal Link

https://supertone.ai/

Time

I / 09:10~09:30

Subject

Voice synthesis

Abstract

Recent advancements in generative models are changing the process of media production. Creators can now simply type in a few words to get a clear image of their ideas or use automated services for previously labor-intensive work like color grading and audio mastering, cutting months of work down to a few minutes. In speech synthesis, end-to-end approaches have reached near-human quality by addressing the training-inference mismatch problem of the previous two-stage framework. For generative models to be actively adopted in industry-grade media productions, such as music, movies, and games, however, it is essential to provide creators with models that can more accurately reflect their artistic intent. In this talk, we will share our design principles for constructing a unified voice synthesis framework. By following the proposed framework, we expect creators to find the controllability for pinpointing specific styles of expression, as well as the modularity to handle multiple layers of input. Lastly, short demos of our approach will be shown to introduce how the proposed analysis and synthesis framework can be extended to various applications in the media industry.