Modeling Auditory Perception and Speech Development

태그

Speech Processing

2 more properties

1) Visually Grounded Speech Representation Learning

•

Objective: Self-supervised speech representation learning without text labels through audio-image pair data

Idea from language learning of human

•

Dataset: SpokenCOCO, PlacesAudio 

•

Baseline model: ResDAVEnet-VQ

•

Evaluation: Audio-Image retrieval(Recall @ k), zero-speech challenge metrics

◦

The numbers in below example mean similarity score

Gallery view

•

Topics

◦

Architecture for hierarchical discrete unit discovery

◦

Loss between audio feature sequences and image feature map

◦

Language model using both word-level units and phonetic-level units

2) Speech Generation without Reconstruction Loss

•

Objective: Modeling speech development of human using physical vocal tract model

•

Dataset: LibriSpeech, LJ Speech, Google Speech Commands

•

Physical vocal tract model: VocalTractLab, gnuspeech

•

Evaluation: WER/PER using pretrained ASR model

•

Topics

◦

Articulatory parameters estimation from speech

◦

Voice conversion by controlling stationary articulatory parameters

◦

Concept-to-speech