Table of Contents
Fetching ...

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang

TL;DR

T-CLAP addresses the不足 of temporal reasoning in contrastive language-audio pretraining by introducing temporal-enhanced training via two data-generation pipelines and a temporal-focused loss $L_t$, combined with the standard contrastive loss $L_c$ in the overall objective $L_{train}=L_c+\lambda_l L_t$ where $\lambda_l=0.5$. Using HTSAT and RoBERTa encoders, the model learns temporally ordered audio-text representations and is evaluated across retrieval, zero-shot classification, T-Classify, and text-to-audio generation with AudioLDM. Results indicate robust improvements over baselines, achieving state-of-the-art performance on temporal retrieval tasks and better alignment in generation tasks, albeit with some limitations on longer-context datasets and synthetic negatives. The approach enhances temporal reasoning in multimodal audio-language models, with practical implications for more accurate retrieval, classification, and generation in real-world, temporally structured audio content.

Abstract

Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

TL;DR

T-CLAP addresses the不足 of temporal reasoning in contrastive language-audio pretraining by introducing temporal-enhanced training via two data-generation pipelines and a temporal-focused loss , combined with the standard contrastive loss in the overall objective where . Using HTSAT and RoBERTa encoders, the model learns temporally ordered audio-text representations and is evaluated across retrieval, zero-shot classification, T-Classify, and text-to-audio generation with AudioLDM. Results indicate robust improvements over baselines, achieving state-of-the-art performance on temporal retrieval tasks and better alignment in generation tasks, albeit with some limitations on longer-context datasets and synthetic negatives. The approach enhances temporal reasoning in multimodal audio-language models, with practical implications for more accurate retrieval, classification, and generation in real-world, temporally structured audio content.

Abstract

Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.
Paper Structure (12 sections, 3 equations, 2 figures, 4 tables)

This paper contains 12 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Pipelines of generating the negative captions.
  • Figure 2: Pipeline for training T-CLAP, with original contrastive loss $L_{c}$ on the left and proposed temporal-focused loss $L_{t}$ on the right.