Table of Contents
Fetching ...

MSNER: A Multilingual Speech Dataset for Named Entity Recognition

Quentin Meeus, Marie-Francine Moens, Hugo Van hamme

TL;DR

MSNER addresses the lack of multilingual Spoken NER resources by releasing a VoxPopuli-based corpus annotated for four languages (Dutch, French, German, Spanish) with 590 hours of silver training, 15 hours of silver validation, and 17 hours of gold-standard evaluation data, plus an annotation tool that leverages pre-annotations. It combines filtering, automated pre-annotation, manual refinement, and verification to produce a high-quality multilingual dataset with OntoNotes v5 classes (and a SLUE-style 7-class set), distributed as JSON Lines and on HuggingFace. The authors compare silver and gold annotations and benchmark both pipeline and end-to-end SLU-based NER models, finding that transcription quality largely drives speech-NER performance while end-to-end models can better detect entity presence under noise. Overall, MSNER enables robust benchmarking and cross-language study of Spoken NER, with practical data release and tooling to accelerate future research.

Abstract

While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.

MSNER: A Multilingual Speech Dataset for Named Entity Recognition

TL;DR

MSNER addresses the lack of multilingual Spoken NER resources by releasing a VoxPopuli-based corpus annotated for four languages (Dutch, French, German, Spanish) with 590 hours of silver training, 15 hours of silver validation, and 17 hours of gold-standard evaluation data, plus an annotation tool that leverages pre-annotations. It combines filtering, automated pre-annotation, manual refinement, and verification to produce a high-quality multilingual dataset with OntoNotes v5 classes (and a SLUE-style 7-class set), distributed as JSON Lines and on HuggingFace. The authors compare silver and gold annotations and benchmark both pipeline and end-to-end SLU-based NER models, finding that transcription quality largely drives speech-NER performance while end-to-end models can better detect entity presence under noise. Overall, MSNER enables robust benchmarking and cross-language study of Spoken NER, with practical data release and tooling to accelerate future research.

Abstract

While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.
Paper Structure (17 sections, 1 equation, 4 figures, 4 tables)

This paper contains 17 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Annotated example
  • Figure 2: Evaluation of text-based pretrained NER model against our annotations. Bright colors correspond to the F1-score and faded colors correspond to the label-F1 score, a metric that ignores spelling mistakes and segmentation errors.
  • Figure 3: Distribution of predicted probability score per class given the target class for the text-based model's predictions
  • Figure 4: Confusion matrix, normalized to show the probability distribution of the tags predicted with the text-based model.