Table of Contents
Fetching ...

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier

TL;DR

Speech-MASSIVE addresses the scarcity of massively multilingual SLU resources by creating a spoken-language version of MASSIVE across 12 languages and evaluating SLU with cascaded and end-to-end architectures under zero-shot, few-shot, and full-fine-tune regimes. The work provides extensive data collection, validation, and ASR benchmarking using Whisper, along with versatile baselines for NLU, LID, ST, and multi-task SLU. Key contributions include a crowdsourced data collection pipeline with quality controls, comprehensive statistics, and cross-task baselines, enabling robust evaluation of speech foundation models on multilingual SLU tasks. This dataset and the accompanying baselines advance multilingual speech research and support cross-language benchmarking of SLU, ASR, and related tasks, with practical implications for building multilingual voice assistants and cross-language NLP systems.

Abstract

We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across languages and tasks. We provide a multimodal, multitask, multilingual dataset and report SLU baselines using both cascaded and end-to-end architectures in various training scenarios (zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the suitability of Speech-MASSIVE for benchmarking other tasks such as speech transcription, language identification, and speech translation. The dataset, models, and code are publicly available at: https://github.com/hlt-mt/Speech-MASSIVE

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

TL;DR

Speech-MASSIVE addresses the scarcity of massively multilingual SLU resources by creating a spoken-language version of MASSIVE across 12 languages and evaluating SLU with cascaded and end-to-end architectures under zero-shot, few-shot, and full-fine-tune regimes. The work provides extensive data collection, validation, and ASR benchmarking using Whisper, along with versatile baselines for NLU, LID, ST, and multi-task SLU. Key contributions include a crowdsourced data collection pipeline with quality controls, comprehensive statistics, and cross-task baselines, enabling robust evaluation of speech foundation models on multilingual SLU tasks. This dataset and the accompanying baselines advance multilingual speech research and support cross-language benchmarking of SLU, ASR, and related tasks, with practical implications for building multilingual voice assistants and cross-language NLP systems.

Abstract

We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across languages and tasks. We provide a multimodal, multitask, multilingual dataset and report SLU baselines using both cascaded and end-to-end architectures in various training scenarios (zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the suitability of Speech-MASSIVE for benchmarking other tasks such as speech transcription, language identification, and speech translation. The dataset, models, and code are publicly available at: https://github.com/hlt-mt/Speech-MASSIVE
Paper Structure (13 sections, 3 figures, 8 tables)

This paper contains 13 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: NLU vs Cascaded SLU (Intent Accuracy) on our Speech-MASSIVE Dataset.
  • Figure 2: Input/Output formatting across NLU/SLU tasks. En: original English text. Fr: French translation in MASSIVE. Annot, Slots and Intent: slot and intent annotation of MASSIVE.
  • Figure 3: Various task control tokens fed to Whisper's decoder.