Table of Contents
Fetching ...

DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training

Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li

TL;DR

DSCLAP tackles the lack of domain-aligned language-audio pre-training for intelligent voice assistants by using ASR-generated text from raw audio to perform cross-modal learning in the target domain. It jointly optimizes an InfoNCE-based contrastive objective and a language-audio matching objective, enhanced by hard negative sampling to improve efficiency. Pre-trained on 12,107 hours of in-vehicle data, DSCLAP achieves state-of-the-art results on two downstream tasks, MDSD and MCIC, and shows robustness to ASR noise while reducing data requirements. This approach lowers data collection costs and enables practical, domain-specialized multimodal pre-training for IVAs.

Abstract

Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications.

DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training

TL;DR

DSCLAP tackles the lack of domain-aligned language-audio pre-training for intelligent voice assistants by using ASR-generated text from raw audio to perform cross-modal learning in the target domain. It jointly optimizes an InfoNCE-based contrastive objective and a language-audio matching objective, enhanced by hard negative sampling to improve efficiency. Pre-trained on 12,107 hours of in-vehicle data, DSCLAP achieves state-of-the-art results on two downstream tasks, MDSD and MCIC, and shows robustness to ASR noise while reducing data requirements. This approach lowers data collection costs and enables practical, domain-specialized multimodal pre-training for IVAs.

Abstract

Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications.
Paper Structure (15 sections, 5 equations, 2 figures, 4 tables)

This paper contains 15 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: An illustration of the DSCLAP framework. In contrast to prior work elizalde2022clap that leverages pre-prepared language-audio pairs for contrastive learning pretraining (red dashed arrows), our DSCLAP (black hard arrows) requires only raw audio inputs. Besides the standard InfoNCE loss, inspired by Zeng2022CrossViewLM, we introduce a Language-Audio Matching (LAM) objective to achieve more effective contrastive learning.
  • Figure 2: Accuracy curves for MDSD on ASR-only dataset w.r.t. different size of training data. Values are for five runs across random seeds.