Domain Adaptation for Contrastive Audio-Language Models
Soham Deshmukh, Rita Singh, Bhiksha Raj
TL;DR
The paper tackles the sensitivity of contrastive Audio-Language Models to domain-specific prompts by introducing a test-time domain adaptation method that learns a domain vector to enforce prediction consistency across multiple augmented views of the test audio. The approach decomposes into three stages—Augment, Combine, and Optimize—and uses an entropy-based self-supervised loss to update a domain embedding without labeled data. Empirically, it achieves an average zero-shot improvement of about $3.2\%$ with one unlabeled example (up to $8.4\%$ in some tasks) and around $4.7\%$ with five examples across 12 tasks, while largely preserving the model's generalization capabilities. The method incurs additional compute due to augmentations and backpropagation but offers a practical on-device adaptation pathway and demonstrates robust cross-domain performance gains.
Abstract
Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performance, like few-shot learning or fine-tuning, require access to annotated data and iterations of training. Therefore, we propose a test-time domain adaptation method for ALMs that does not require access to annotations. Our method learns a domain vector by enforcing consistency across augmented views of the testing audio. We extensively evaluate our approach on 12 downstream tasks across domains. With just one example, our domain adaptation method leads to 3.2% (max 8.4%) average zero-shot performance improvement. After adaptation, the model still retains the generalization property of ALMs.
