Table of Contents
Fetching ...

PALM: Few-Shot Prompt Learning for Audio Language Models

Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, Hanan Aldarmaki

TL;DR

This work proposes a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch, and demonstrates the effectiveness of this approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks.

Abstract

Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding. Code is available at https://asif-hanif.github.io/palm/

PALM: Few-Shot Prompt Learning for Audio Language Models

TL;DR

This work proposes a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch, and demonstrates the effectiveness of this approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks.

Abstract

Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding. Code is available at https://asif-hanif.github.io/palm/
Paper Structure (17 sections, 6 equations, 5 figures, 4 tables)

This paper contains 17 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of our proposed approach, PALM, with three baselines: ZERO-SHOT deshmukh2023pengi, COOP zhou2022coop and COCOOP zhou2022conditional. Bar plots show classification accuracy averaged across 11 audio datasets encompassing various speech-processing tasks.
  • Figure 2: Impact of Hand-crafted Prompts on ZERO-SHOT Performance Zero-shot accuracy across four audio recognition datasets (ESC50 piczak2015dataset, GT-Music-Genre sturm2012analysis, SESA spadini2019sound, and VocalSound gong_psla) is evaluated with eight different text prompts using PENGI deshmukh2023pengi model. The accuracy varies with changes in the handcrafted prompts.
  • Figure 3: Overview of Zero-Shot, COOP, PALM(a)Zero-Shot inference involves matching the embedding of the audio waveform with the embeddings of text prompts for each class. The class with the highest matching score is then assigned to the audio. (b)COOPzhou2022coop avoids using handcrafted text prompts by learning the context of text prompts in the token embedding space. It optimizes the input space of the text encoder to enhance classification performance. (c)PALM requires only class names at the input of text encoder and it optimizes the feature space by adding learnable context embeddings to text feature vectors. PALM not only outperforms COOP (see Table \ref{['tab:main_results']}), but it is also more computationally efficient since it does not require gradients to flow through the text encoder, unlike COOP.
  • Figure 4: Comparison of $\mathrm{PALM}^{\dagger}$ and $\mathrm{PALM}$. Here, $\mathrm{PALM}^{\dagger}$ refers to setting in which the Learnable Context embeddings (see Figure \ref{['fig:main_diagram']} for reference) have been removed from the feature space of the text encoder. The removal of context embeddings drastically degrades performance, highlighting their importance.
  • Figure 5: A higher number of shots generally leads to increased audio classification accuracy using PALM.