Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs
Anshuman Sinha, Camille Migozzi, Aubin Rey, Chao Zhang
TL;DR
This work tackles the gap in audio-language models' temporal and compositional understanding. It introduces TeminAL, a two-stage post-training approach that first teaches models to distinguish single versus multiple sounds and then to reason about temporal relationships using temporally inverted and overlapped data, guided by a Temporal Noise Contrastive Estimation loss. The method yields a notable temporal understanding gain on ESC-50 and preserves strong zero-shot retrieval on established benchmarks, while also proposing ZSTE as a general-purpose framework for zero-shot temporal evaluation. Overall, TeminAL advances temporally aware ALMs under a constrained compute budget and offers a principled evaluation path for zero-shot temporal capabilities with practical implications for downstream tasks requiring timing-aware reasoning.
Abstract
Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL. We implement a two-stage training scheme TeminAL A $\&$ B, where the model first learns to differentiate between multiple sounds in TeminAL A, followed by a phase that instills a sense of time, thereby enhancing its temporal understanding in TeminAL B. This approach results in an average performance gain of $5.28\%$ in temporal understanding on the ESC-50 dataset, while the model remains competitive in zero-shot retrieval and classification tasks on the AudioCap/Clotho datasets. We also note the lack of proper evaluation techniques for contrastive ALMs and propose a strategy for evaluating ALMs in zero-shot settings. The general-purpose zero-shot model evaluation strategy ZSTE, is used to evaluate various prior models. ZSTE demonstrates a general strategy to evaluate all ZS contrastive models. The model trained with TeminAL successfully outperforms current models on most downstream tasks.
