1) Visually Grounded Speech Representation Learning
•
Objective: Self-supervised speech representation learning without text labels through audio-image pair data
Idea from language learning of human
•
Dataset: SpokenCOCO, PlacesAudio
•
Baseline model: ResDAVEnet-VQ
•
Evaluation: Audio-Image retrieval(Recall @ k), zero-speech challenge metrics
◦
The numbers in below example mean similarity score
Gallery view
Search
•
Topics
◦
Architecture for hierarchical discrete unit discovery
◦
Loss between audio feature sequences and image feature map
◦
Language model using both word-level units and phonetic-level units
2) Speech Generation without Reconstruction Loss
•
Objective: Modeling speech development of human using physical vocal tract model
•
Dataset: LibriSpeech, LJ Speech, Google Speech Commands
•
Physical vocal tract model: VocalTractLab, gnuspeech
•
Evaluation: WER/PER using pretrained ASR model
•
Topics
◦
Articulatory parameters estimation from speech
◦
Voice conversion by controlling stationary articulatory parameters
◦
Concept-to-speech