Table of Contents
Fetching ...

CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval

Mohammad Mahdi Abootorabi, Ehsaneddin Asgari

TL;DR

CLASP tackles multilingual audio-text information retrieval by learning a shared embedding space that directly maps speech to text without transcription. It fuses self-supervised speech representations with spectrogram features and aligns them to frozen multilingual text encoders (XLM-RoBERTa or LaBSE), trained on the Speech Brown dataset alongside Common Voice V4 and FLEURS. The results show state-of-the-art retrieval metrics across languages, with a smaller and faster model compared to ASR-based pipelines, enabled by a contrastive loss that improves semantic alignment. This work delivers a practical, scalable approach to cross-language retrieval over audio content and provides the Speech Brown dataset to support broader multimodal research.

Abstract

This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP's audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval methods that rely on transcribing speech into text for subsequent text retrieval, especially in specific scenarios.

CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval

TL;DR

CLASP tackles multilingual audio-text information retrieval by learning a shared embedding space that directly maps speech to text without transcription. It fuses self-supervised speech representations with spectrogram features and aligns them to frozen multilingual text encoders (XLM-RoBERTa or LaBSE), trained on the Speech Brown dataset alongside Common Voice V4 and FLEURS. The results show state-of-the-art retrieval metrics across languages, with a smaller and faster model compared to ASR-based pipelines, enabled by a contrastive loss that improves semantic alignment. This work delivers a practical, scalable approach to cross-language retrieval over audio content and provides the Speech Brown dataset to support broader multimodal research.

Abstract

This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP's audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval methods that rely on transcribing speech into text for subsequent text retrieval, especially in specific scenarios.

Paper Structure

This paper contains 22 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the pipelines and model architecture. (a): The training pipeline and architecture for CLASP, utilizing batches of speech inputs paired with their corresponding text data. (b): The process of generating the final dataset for model training and evaluation. (c): The inference pipeline for retrieving the exact or nearest audio from the test dataset that matches a given text query in any language.
  • Figure 2: Overview of the two proposed strategies for the fusion encoder network architecture. (a) Gating Mechanism: A gating mechanism determines the contribution of self-supervised speech and spectrogram embeddings after initial transformations. (b) Concatenation Strategy: After initial transformations, the resulting self-supervised speech and spectrogram embeddings are concatenated and passed through a neural network encoder comprising linear, batch normalization, dropout, and activation layers to generate the final encoding.
  • Figure 3: 2-D illustration of sentence-level embeddings from different modalities, showing effective projection in the shared representation space for the test dataset. On the left, the self-supervised speech embeddings (blue) and text embeddings (red) are depicted. On the right, the CLASP output embeddings for speech (green) and text embeddings (red) are presented. These plots were generated using the t-SNE vanDerMaaten2008 dimensionality reduction technique.
  • Figure 4: Sentence Distribution by Category for Speech Brown Dataset. This horizontal bar chart illustrates the number of sentences across 15 different textual categories in the Speech Brown Dataset.