Table of Contents
Fetching ...

SpeechTaxi: On Multilingual Semantic Speech Classification

Lennart Keller, Goran Glavaš

TL;DR

The paper addresses semantic speech classification in multilingual settings by contrasting end-to-end (E2E) approaches that fine-tune multilingual speech encoders with cascade approaches (CA) that first transcribe speech before text-based classification. It introduces SpeechTaxi, an 80-hour multilingual dataset spanning 28 languages with six semantic classes, derived from Taxi1500 and Bible audio, and evaluates E2E and CA under In-Language, All-Languages, and Zero-Shot Cross-Lingual Transfer scenarios. Key findings show E2E can outperform CA in monolingual contexts, but struggles with cross-lingual transfer, whereas CA gains from multilingual training, with Romanized transcription offering a robust path for languages without ASR. The work highlights that a language-agnostic CA based on Romanized text can be particularly effective for low-resource, spoken-only languages, and provides publicly available data to support further multilingual SLU research.

Abstract

Recent advancements in multilingual speech encoding as well as transcription raise the question of the most effective approach to semantic speech classification. Concretely, can (1) end-to-end (E2E) classifiers obtained by fine-tuning state-of-the-art multilingual speech encoders (MSEs) match or surpass the performance of (2) cascading (CA), where speech is first transcribed into text and classification is delegated to a text-based classifier. To answer this, we first construct SpeechTaxi, an 80-hour multilingual dataset for semantic speech classification of Bible verses, covering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide range of experiments comparing E2E and CA in monolingual semantic speech classification as well as in cross-lingual transfer. We find that E2E based on MSEs outperforms CA in monolingual setups, i.e., when trained on in-language data. However, MSEs seem to have poor cross-lingual transfer abilities, with E2E substantially lagging CA both in (1) zero-shot transfer to languages unseen in training and (2) multilingual training, i.e., joint training on multiple languages. Finally, we devise a novel CA approach based on transcription to Romanized text as a language-agnostic intermediate representation and show that it represents a robust solution for languages without native ASR support. Our SpeechTaxi dataset is publicly available at: https://huggingface.co/ datasets/LennartKeller/SpeechTaxi/.

SpeechTaxi: On Multilingual Semantic Speech Classification

TL;DR

The paper addresses semantic speech classification in multilingual settings by contrasting end-to-end (E2E) approaches that fine-tune multilingual speech encoders with cascade approaches (CA) that first transcribe speech before text-based classification. It introduces SpeechTaxi, an 80-hour multilingual dataset spanning 28 languages with six semantic classes, derived from Taxi1500 and Bible audio, and evaluates E2E and CA under In-Language, All-Languages, and Zero-Shot Cross-Lingual Transfer scenarios. Key findings show E2E can outperform CA in monolingual contexts, but struggles with cross-lingual transfer, whereas CA gains from multilingual training, with Romanized transcription offering a robust path for languages without ASR. The work highlights that a language-agnostic CA based on Romanized text can be particularly effective for low-resource, spoken-only languages, and provides publicly available data to support further multilingual SLU research.

Abstract

Recent advancements in multilingual speech encoding as well as transcription raise the question of the most effective approach to semantic speech classification. Concretely, can (1) end-to-end (E2E) classifiers obtained by fine-tuning state-of-the-art multilingual speech encoders (MSEs) match or surpass the performance of (2) cascading (CA), where speech is first transcribed into text and classification is delegated to a text-based classifier. To answer this, we first construct SpeechTaxi, an 80-hour multilingual dataset for semantic speech classification of Bible verses, covering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide range of experiments comparing E2E and CA in monolingual semantic speech classification as well as in cross-lingual transfer. We find that E2E based on MSEs outperforms CA in monolingual setups, i.e., when trained on in-language data. However, MSEs seem to have poor cross-lingual transfer abilities, with E2E substantially lagging CA both in (1) zero-shot transfer to languages unseen in training and (2) multilingual training, i.e., joint training on multiple languages. Finally, we devise a novel CA approach based on transcription to Romanized text as a language-agnostic intermediate representation and show that it represents a robust solution for languages without native ASR support. Our SpeechTaxi dataset is publicly available at: https://huggingface.co/ datasets/LennartKeller/SpeechTaxi/.
Paper Structure (11 sections, 2 figures, 5 tables)

This paper contains 11 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An instance of SpeechTaxi.
  • Figure 2: Illustration of SpeechTaxi creation from Taxi1500 and OpenBible data.