Table of Contents
Fetching ...

Domain Adaptation for Contrastive Audio-Language Models

Soham Deshmukh, Rita Singh, Bhiksha Raj

TL;DR

The paper tackles the sensitivity of contrastive Audio-Language Models to domain-specific prompts by introducing a test-time domain adaptation method that learns a domain vector to enforce prediction consistency across multiple augmented views of the test audio. The approach decomposes into three stages—Augment, Combine, and Optimize—and uses an entropy-based self-supervised loss to update a domain embedding without labeled data. Empirically, it achieves an average zero-shot improvement of about $3.2\%$ with one unlabeled example (up to $8.4\%$ in some tasks) and around $4.7\%$ with five examples across 12 tasks, while largely preserving the model's generalization capabilities. The method incurs additional compute due to augmentations and backpropagation but offers a practical on-device adaptation pathway and demonstrates robust cross-domain performance gains.

Abstract

Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performance, like few-shot learning or fine-tuning, require access to annotated data and iterations of training. Therefore, we propose a test-time domain adaptation method for ALMs that does not require access to annotations. Our method learns a domain vector by enforcing consistency across augmented views of the testing audio. We extensively evaluate our approach on 12 downstream tasks across domains. With just one example, our domain adaptation method leads to 3.2% (max 8.4%) average zero-shot performance improvement. After adaptation, the model still retains the generalization property of ALMs.

Domain Adaptation for Contrastive Audio-Language Models

TL;DR

The paper tackles the sensitivity of contrastive Audio-Language Models to domain-specific prompts by introducing a test-time domain adaptation method that learns a domain vector to enforce prediction consistency across multiple augmented views of the test audio. The approach decomposes into three stages—Augment, Combine, and Optimize—and uses an entropy-based self-supervised loss to update a domain embedding without labeled data. Empirically, it achieves an average zero-shot improvement of about with one unlabeled example (up to in some tasks) and around with five examples across 12 tasks, while largely preserving the model's generalization capabilities. The method incurs additional compute due to augmentations and backpropagation but offers a practical on-device adaptation pathway and demonstrates robust cross-domain performance gains.

Abstract

Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performance, like few-shot learning or fine-tuning, require access to annotated data and iterations of training. Therefore, we propose a test-time domain adaptation method for ALMs that does not require access to annotations. Our method learns a domain vector by enforcing consistency across augmented views of the testing audio. We extensively evaluate our approach on 12 downstream tasks across domains. With just one example, our domain adaptation method leads to 3.2% (max 8.4%) average zero-shot performance improvement. After adaptation, the model still retains the generalization property of ALMs.
Paper Structure (17 sections, 4 equations, 2 figures, 3 tables)

This paper contains 17 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Our method takes a single input audio at test time and optimizes a domain embedding using an entropy-based loss function. The method does not require labels and enforces consistent prediction across augmented views.
  • Figure 2: Cross-domain adaptation performance. The x-axis indicates the dataset used for domain adaptation. The upper and lower bar indicate the change in - domain performance and average zero-shot score respectively.