Abstract
We present PromptCLAP, a prompt tuning framework for the state-of-the-art audio-language model CLAP. By evaluating on selected manual prompts, we observe that CLAP is sensitive to input text and uses a suboptimal prompt for zero-shot classification. To address this issue, PromptCLAP adopts the Context Optimization (CoOp) framework, originally proposed for vision-language models. With learnable prompt embeddings, PromptCLAP enables few-shot audio classification by tuning only a small number of parameters(∼12k). We show that PromptCLAP achieves significant improvements over zero-shot CLAP and exhibits a certain level of robustness to domain shift. The results highlight the potential of prompt tuning as an efficient method to utilize audio-language models. We expect the PromptCLAP framework to be used in other audio-language tasks, such as text-audio retrieval and text-to-audio generation.