MSNER: A Multilingual Speech Dataset for Named Entity Recognition
Quentin Meeus, Marie-Francine Moens, Hugo Van hamme
TL;DR
MSNER addresses the lack of multilingual Spoken NER resources by releasing a VoxPopuli-based corpus annotated for four languages (Dutch, French, German, Spanish) with 590 hours of silver training, 15 hours of silver validation, and 17 hours of gold-standard evaluation data, plus an annotation tool that leverages pre-annotations. It combines filtering, automated pre-annotation, manual refinement, and verification to produce a high-quality multilingual dataset with OntoNotes v5 classes (and a SLUE-style 7-class set), distributed as JSON Lines and on HuggingFace. The authors compare silver and gold annotations and benchmark both pipeline and end-to-end SLU-based NER models, finding that transcription quality largely drives speech-NER performance while end-to-end models can better detect entity presence under noise. Overall, MSNER enables robust benchmarking and cross-language study of Spoken NER, with practical data release and tooling to accelerate future research.
Abstract
While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.
