Leveraging Cross-Lingual Transfer Learning in Spoken Named Entity Recognition Systems
Moncef Benaicha, David Thulke, M. A. Tuğtekin Turan
TL;DR
The paper addresses the challenge of spoken NER for spoken document retrieval and investigates cross-lingual transfer learning across Dutch, English, and German using both pipeline and End-to-End (E2E) approaches. It leverages Wav2Vec2-XLS-R-300M for ASR and XLM-RL for NER, with pseudo-annotated data generated from CoNLL on Common Voice speech; the study explores zero-shot and fine-tuned transfers, finding that E2E models are generally more effective and that cross-lingual transfer—especially German→Dutch—yields meaningful gains under limited target data. Key contributions include a comprehensive comparison of pipeline versus E2E in three languages, demonstration of cross-lingual transfer benefits, and the use of pseudo-annotations to enable spoken NER research in low-resource contexts. The work has practical implications for spoken document retrieval and multilingual NER by reducing annotation costs and showing robust cross-language transfer capabilities.
Abstract
Recent Named Entity Recognition (NER) advancements have significantly enhanced text classification capabilities. This paper focuses on spoken NER, aimed explicitly at spoken document retrieval, an area not widely studied due to the lack of comprehensive datasets for spoken contexts. Additionally, the potential for cross-lingual transfer learning in low-resource situations deserves further investigation. In our study, we applied transfer learning techniques across Dutch, English, and German using both pipeline and End-to-End (E2E) approaches. We employed Wav2Vec2 XLS-R models on custom pseudo-annotated datasets to evaluate the adaptability of cross-lingual systems. Our exploration of different architectural configurations assessed the robustness of these systems in spoken NER. Results showed that the E2E model was superior to the pipeline model, particularly with limited annotation resources. Furthermore, transfer learning from German to Dutch improved performance by 7% over the standalone Dutch E2E system and 4% over the Dutch pipeline model. Our findings highlight the effectiveness of cross-lingual transfer in spoken NER and emphasize the need for additional data collection to improve these systems.
