Table of Contents
Fetching ...

Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio

Gongyu Chen, Haomin Zhang, Chaofan Ding, Zihao Chen, Xinhan Di

TL;DR

This work tackles the challenge of improving zero-shot performance for pre-trained audio-language models without labeled data by introducing an unsupervised, end-to-end test-time adaptation framework. The method combines conditional context- and domain-aware prompts with a multi-view augmentation strategy and a dual loss objective, resulting in $\\mathcal{L}_{final} = \\mathcal{L}_{consistency} + \\lambda_{contrastive} \\mathcal{L}_{contrastive}$ to guide adaptation. Evaluations across 12 diverse downstream tasks show an average zero-shot accuracy improvement of $4.41\%$ (up to $7.50\%$), outperforming state-of-the-art unsupervised DA baselines and showing robust cross-domain generalization. The approach reduces dependence on labeled data and enhances deployment practicality for ALMs in real-world distribution shifts.

Abstract

One fascinating aspect of pre-trained Audio-Language Models (ALMs) learning is their impressive zero-shot generalization capability and test-time adaptation (TTA) methods aiming to improve domain performance without annotations. However, previous test time adaptation (TTA) methods for ALMs in zero-shot classification tend to be stuck in incorrect model predictions. In order to further boost the performance, we propose multiple guidance on prompt learning without annotated labels. First, guidance of consistency on both context tokens and domain tokens of ALMs is set. Second, guidance of both consistency across multiple augmented views of each single test sample and contrastive learning across different test samples is set. Third, we propose a corresponding end-end learning framework for the proposed test-time adaptation method without annotated labels. We extensively evaluate our approach on 12 downstream tasks across domains, our proposed adaptation method leads to 4.41% (max 7.50%) average zero-shot performance improvement in comparison with the state-of-the-art models.

Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio

TL;DR

This work tackles the challenge of improving zero-shot performance for pre-trained audio-language models without labeled data by introducing an unsupervised, end-to-end test-time adaptation framework. The method combines conditional context- and domain-aware prompts with a multi-view augmentation strategy and a dual loss objective, resulting in to guide adaptation. Evaluations across 12 diverse downstream tasks show an average zero-shot accuracy improvement of (up to ), outperforming state-of-the-art unsupervised DA baselines and showing robust cross-domain generalization. The approach reduces dependence on labeled data and enhances deployment practicality for ALMs in real-world distribution shifts.

Abstract

One fascinating aspect of pre-trained Audio-Language Models (ALMs) learning is their impressive zero-shot generalization capability and test-time adaptation (TTA) methods aiming to improve domain performance without annotations. However, previous test time adaptation (TTA) methods for ALMs in zero-shot classification tend to be stuck in incorrect model predictions. In order to further boost the performance, we propose multiple guidance on prompt learning without annotated labels. First, guidance of consistency on both context tokens and domain tokens of ALMs is set. Second, guidance of both consistency across multiple augmented views of each single test sample and contrastive learning across different test samples is set. Third, we propose a corresponding end-end learning framework for the proposed test-time adaptation method without annotated labels. We extensively evaluate our approach on 12 downstream tasks across domains, our proposed adaptation method leads to 4.41% (max 7.50%) average zero-shot performance improvement in comparison with the state-of-the-art models.

Paper Structure

This paper contains 16 sections, 11 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Our end-end Test-time Tuning framework. (1) Augmentation. Four augmentations are performed on the raw audio, with Time Reorder first cutting the spectrum in half and then swapping the order before and after. (2) Combination. Multiple conditional consistency networks receive the audio embedding and generate learnable tokens combined with the original prompt in two ways. (3) Optimization. The minimum self-entropy loss and contrastive learning loss are calculated using the average distribution.