Table of Contents
Fetching ...

GLAP: General contrastive audio-text pretraining across domains and languages

Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, Jian Luan

TL;DR

GLAP addresses the gap in multilingual speech-text understanding by proposing a unified, general language audio pretraining framework that jointly aligns speech content with text while preserving sound and music capabilities. It uses a pre-trained multilingual text encoder and a general audio encoder with trainable projections, trained via a sigmoid-based contrastive loss to enable large-batch efficiency. The model is trained on a diversified, multilingual dataset mix and evaluated across audio, music, speech, and zero-shot tasks, showing competitive standard benchmarks and strong multilingual performance, including keyword spotting across 50 languages. Notably, GLAP achieves high speech and music retrieval in English and Chinese, plus robust zero-shot and multilingual capabilities on the MSW corpus, underscoring its cross-domain and cross-language applicability. The work provides a practical, public checkpoints release and demonstrates that a single model can unify general audio-text embeddings across domains beyond English.

Abstract

Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP's advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: https://github.com/xiaomi-research/dasheng-glap.

GLAP: General contrastive audio-text pretraining across domains and languages

TL;DR

GLAP addresses the gap in multilingual speech-text understanding by proposing a unified, general language audio pretraining framework that jointly aligns speech content with text while preserving sound and music capabilities. It uses a pre-trained multilingual text encoder and a general audio encoder with trainable projections, trained via a sigmoid-based contrastive loss to enable large-batch efficiency. The model is trained on a diversified, multilingual dataset mix and evaluated across audio, music, speech, and zero-shot tasks, showing competitive standard benchmarks and strong multilingual performance, including keyword spotting across 50 languages. Notably, GLAP achieves high speech and music retrieval in English and Chinese, plus robust zero-shot and multilingual capabilities on the MSW corpus, underscoring its cross-domain and cross-language applicability. The work provides a practical, public checkpoints release and demonstrates that a single model can unify general audio-text embeddings across domains beyond English.

Abstract

Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP's advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: https://github.com/xiaomi-research/dasheng-glap.

Paper Structure

This paper contains 12 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: GLAP’s retrieval and zero-shot performance. A@T and T@A represent retrieval tasks of Audio-to-Text and Text-to-Audio, respectively, others are zero-shot (#labels).
  • Figure 2: GLAP enables multilingual speech-content retrieval, ontop of the standard sound/music capabilities.
  • Figure 3: Multilingual zero-shot keyword spotting performance across 50 languages. The number of keywords (num) for each language is shown on the right.