Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

Anshuman Sinha; Camille Migozzi; Aubin Rey; Chao Zhang

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

Anshuman Sinha, Camille Migozzi, Aubin Rey, Chao Zhang

TL;DR

This work tackles the gap in audio-language models' temporal and compositional understanding. It introduces TeminAL, a two-stage post-training approach that first teaches models to distinguish single versus multiple sounds and then to reason about temporal relationships using temporally inverted and overlapped data, guided by a Temporal Noise Contrastive Estimation loss. The method yields a notable temporal understanding gain on ESC-50 and preserves strong zero-shot retrieval on established benchmarks, while also proposing ZSTE as a general-purpose framework for zero-shot temporal evaluation. Overall, TeminAL advances temporally aware ALMs under a constrained compute budget and offers a principled evaluation path for zero-shot temporal capabilities with practical implications for downstream tasks requiring timing-aware reasoning.

Abstract

Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL. We implement a two-stage training scheme TeminAL A $\&$ B, where the model first learns to differentiate between multiple sounds in TeminAL A, followed by a phase that instills a sense of time, thereby enhancing its temporal understanding in TeminAL B. This approach results in an average performance gain of $5.28\%$ in temporal understanding on the ESC-50 dataset, while the model remains competitive in zero-shot retrieval and classification tasks on the AudioCap/Clotho datasets. We also note the lack of proper evaluation techniques for contrastive ALMs and propose a strategy for evaluating ALMs in zero-shot settings. The general-purpose zero-shot model evaluation strategy ZSTE, is used to evaluate various prior models. ZSTE demonstrates a general strategy to evaluate all ZS contrastive models. The model trained with TeminAL successfully outperforms current models on most downstream tasks.

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

TL;DR

Abstract

B, where the model first learns to differentiate between multiple sounds in TeminAL A, followed by a phase that instills a sense of time, thereby enhancing its temporal understanding in TeminAL B. This approach results in an average performance gain of

in temporal understanding on the ESC-50 dataset, while the model remains competitive in zero-shot retrieval and classification tasks on the AudioCap/Clotho datasets. We also note the lack of proper evaluation techniques for contrastive ALMs and propose a strategy for evaluating ALMs in zero-shot settings. The general-purpose zero-shot model evaluation strategy ZSTE, is used to evaluate various prior models. ZSTE demonstrates a general strategy to evaluate all ZS contrastive models. The model trained with TeminAL successfully outperforms current models on most downstream tasks.

Paper Structure (29 sections, 26 equations, 15 figures, 7 tables, 4 algorithms)

This paper contains 29 sections, 26 equations, 15 figures, 7 tables, 4 algorithms.

Introduction
Background and Related Work
Foundation models and Multi-modal text-audio learning
Self-Supervised Learning and Post-Training
Zero-shot Inference: Limitations of Classical Zero-Shot Retrieval
Methodology
Preliminaries
Data-processing: Designing our training data.
Preliminaries of post--training with SSL
Objective function for TeminAL: What addition we propose on classical contrastive learning
Details on hyper-parameters of the Loss formulation
Experiments
Base model
ZSTE and Downstream Tasks
Results
...and 14 more sections

Figures (15)

Figure 1: The overview of TeminAL where we are post--training orginal CLAP encoders $f_c$ and $f_a$ with our TeminAL method to get $f^t_c$ and $f^t_a$ after application of the two--stage training. We only train a subset of the total weights ($f^t_{c_{\theta}}$ and $f^t_{a_{\phi}}$) in both our training stages. Mathematical formualtion of the functions are elaborated in \ref{['prem_ssl']} and \ref{['TeminAL']}.
Figure 2: Temporal Augmentations
Figure 3: The overview of TeminAL B where we are post--training orginal CLAP encoders $f_c$ and $f_a$ with our TeminAL method to get $f^t_c$ and $f^t_a$. The functions as described in \ref{['prem_ssl']}, while the objective formulation for training ($f_c$, $f_a$) to achieve ($f^t_c$, $f^t_a$) has been described in \ref{['TeminAL']}. The "Temporal contrastive loss" for TeminAL B has been elaborated in \ref{['fig6']}.
Figure 4: The schematic showing Temporal Contrastive Loss for TeminAL B. On the vertical axis we have the audio embeddings with batches of data corresponding to $B_{a_B} = \{B_{{a}_f}, B_{{a}_r}, B_{{a}_o}\}$ and text embedding batches of data corresponding to $B_{c_B} = \{B_{{c}_f}, B_{{c}_r}, B_{{c}_o}\}$ on the horizontal axis.
Figure 5: Schematic explanation of the terms in loss function for TeminAL B. Here we show a term (row) in the summation of $L_{t_B}$ which is $\text{TNCE}_{t}({\bm{z}}_a, {\bm{z}}_t)$ The other two terms $\text{TNCE}_{t}({\bm{z}}_a, {\bm{z}}_t)$ and $\text{TNCE}_{t}({\bm{z}}_a, {\bm{z}}_t)$ of this loss function can be calculated in the similar way and will belong to the green and pink blocks of the above schematic. Here, $B_{t_f}, B_{t_r}$ and $B_{t_o}$ are the batches of texts corresponding to time consistent, reversed and overlaid samples which compose the whole batch of text following the same convention as shown in \ref{['TeminAL']}.
...and 10 more figures

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

TL;DR

Abstract

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

Authors

TL;DR

Abstract

Table of Contents

Figures (15)