
Audio-Language Multi-modal Learning

Multi-modal Learning
2 more properties

Topic 1. Text Query based Audio Retrieval

Retrieving audio signals using their sound content textual descriptions (i.e., audio captions).
Text query composed of manually written audio captions.
For each text query, the goal of this task is to retrieve audio files from a given dataset and sort them based their match with the query.

Topic 2. Automated Audio Captioning

The task of general audio content description using free text.
An inter-modal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal.
Modeling concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times").

Research 1. Audio-Text Data Augmentation

Research 2. Audio Captioning

Research 3. Audio-Text Retrieval