Table of Contents
Fetching ...

Cross-lingual Matryoshka Representation Learning across Speech and Text

Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina

TL;DR

This work addresses the dual barriers of language and modality by introducing cross-lingual speech-text Matryoshka representations to retrieve French documents from Wolof speech without ASR-translation. It proposes data pipelines and benchmarks for Wolof-French, and compares text-only, Late-Fusion, and Dual architectures, finding that Late-Fusion with a frozen text Matryoshka model offers the best balance of expressivity, generalization, and efficiency. The results show strong cross-modal retrieval performance, with transfer to unseen tasks like speech intent detection, and reveal that information largely concentrates in a subset of Matryoshka dimensions, suggesting avenues for more efficient deployment. The study highlights practical implications for improving information access in under-represented languages and points to future work on generalized language pairs, dynamic sparsity, and broader multitask capabilities.

Abstract

Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best. Although trained only for retrieval, the model generalizes well to other tasks, such as speech intent detection, indicating the learning of general semantic representations. Finally, we analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks, showing that information is concentrated only in a few components, suggesting potential for efficiency improvements.

Cross-lingual Matryoshka Representation Learning across Speech and Text

TL;DR

This work addresses the dual barriers of language and modality by introducing cross-lingual speech-text Matryoshka representations to retrieve French documents from Wolof speech without ASR-translation. It proposes data pipelines and benchmarks for Wolof-French, and compares text-only, Late-Fusion, and Dual architectures, finding that Late-Fusion with a frozen text Matryoshka model offers the best balance of expressivity, generalization, and efficiency. The results show strong cross-modal retrieval performance, with transfer to unseen tasks like speech intent detection, and reveal that information largely concentrates in a subset of Matryoshka dimensions, suggesting avenues for more efficient deployment. The study highlights practical implications for improving information access in under-represented languages and points to future work on generalized language pairs, dynamic sparsity, and broader multitask capabilities.

Abstract

Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best. Although trained only for retrieval, the model generalizes well to other tasks, such as speech intent detection, indicating the learning of general semantic representations. Finally, we analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks, showing that information is concentrated only in a few components, suggesting potential for efficiency improvements.
Paper Structure (26 sections, 1 equation, 7 figures, 5 tables)

This paper contains 26 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Speech data pipeline. Raw data is filtered via Source Separation, Diarization, VAD, and Quality Filtering. The resulting speech is transcribed, bad transcriptions (in red) are filtered, then used to generate French story, dialog and blogpost documents.
  • Figure 2: Late-Fusion vs. Dual architectures. In the Late-Fusion approach, speech is encoded with HuBERT, then the sequence is downsampled by a CNN (x2), projected to the LLM embedding with a matrix $\textbf{W}$, concatenated with the prompt token embeddings, and the whole is forwarded to the Matryoshka embedding LLM. In the Dual architecture, HuBERT features are pooled at the sequence level using an attention-based pooler, then projected with dimension-specific $\textbf{W}$ matrices to obtain Matryoshka embeddings. In both architectures, text documents are embedded using the text-only Matryoshka embedding LLM.
  • Figure 3: Keyword Spotting (Urban Bus) Performance Comparison across different embedding dimensions.
  • Figure 4: Percentage of dimensions needed to represent a given energy ratio.
  • Figure 5: Percentage of dimensions needed to represent the full energy
  • ...and 2 more figures