Table of Contents
Fetching ...

Do "English" Named Entity Recognizers Work Well on Global Englishes?

Alexander Shan, John Bauer, Riley Carlson, Christopher Manning

TL;DR

The paper addresses the bias of English NER models toward American and British English by introducing the Worldwide English NER Dataset and evaluating multiple tools and training regimes across global English varieties. It shows that models trained on CoNLL03 or OntoNotes exhibit significant $F1$ drops when evaluated on global English data, with the largest regional degradation in Indigenous Oceania and Africa, though Asia and the Middle East fare relatively better. Retraining with the Worldwide dataset improves regional performance, and a combined training approach (Worldwide plus a standard dataset) achieves strong performance on both domains, suggesting that diverse, globally sourced training data is key for robust NER across English varieties. The study also compares classic and neural models (CoreNLP, Flair, SpaCy, Stanza) and finds that while transformer-based embeddings help, none fully resolves regional errors without broad, diverse training data. Overall, the work argues for incorporating global English corpora to promote equitable NER performance and calls for further exploration across other languages and dialects.

Abstract

The vast majority of the popular English named entity recognition (NER) datasets contain American or British English data, despite the existence of many global varieties of English. As such, it is unclear whether they generalize for analyzing use of English globally. To test this, we build a newswire dataset, the Worldwide English NER Dataset, to analyze NER model performance on low-resource English variants from around the world. We test widely used NER toolkits and transformer models, including models using the pre-trained contextual models RoBERTa and ELECTRA, on three datasets: a commonly used British English newswire dataset, CoNLL 2003, a more American focused dataset OntoNotes, and our global dataset. All models trained on the CoNLL or OntoNotes datasets experienced significant performance drops-over 10 F1 in some cases-when tested on the Worldwide English dataset. Upon examination of region-specific errors, we observe the greatest performance drops for Oceania and Africa, while Asia and the Middle East had comparatively strong performance. Lastly, we find that a combined model trained on the Worldwide dataset and either CoNLL or OntoNotes lost only 1-2 F1 on both test sets.

Do "English" Named Entity Recognizers Work Well on Global Englishes?

TL;DR

The paper addresses the bias of English NER models toward American and British English by introducing the Worldwide English NER Dataset and evaluating multiple tools and training regimes across global English varieties. It shows that models trained on CoNLL03 or OntoNotes exhibit significant drops when evaluated on global English data, with the largest regional degradation in Indigenous Oceania and Africa, though Asia and the Middle East fare relatively better. Retraining with the Worldwide dataset improves regional performance, and a combined training approach (Worldwide plus a standard dataset) achieves strong performance on both domains, suggesting that diverse, globally sourced training data is key for robust NER across English varieties. The study also compares classic and neural models (CoreNLP, Flair, SpaCy, Stanza) and finds that while transformer-based embeddings help, none fully resolves regional errors without broad, diverse training data. Overall, the work argues for incorporating global English corpora to promote equitable NER performance and calls for further exploration across other languages and dialects.

Abstract

The vast majority of the popular English named entity recognition (NER) datasets contain American or British English data, despite the existence of many global varieties of English. As such, it is unclear whether they generalize for analyzing use of English globally. To test this, we build a newswire dataset, the Worldwide English NER Dataset, to analyze NER model performance on low-resource English variants from around the world. We test widely used NER toolkits and transformer models, including models using the pre-trained contextual models RoBERTa and ELECTRA, on three datasets: a commonly used British English newswire dataset, CoNLL 2003, a more American focused dataset OntoNotes, and our global dataset. All models trained on the CoNLL or OntoNotes datasets experienced significant performance drops-over 10 F1 in some cases-when tested on the Worldwide English dataset. Upon examination of region-specific errors, we observe the greatest performance drops for Oceania and Africa, while Asia and the Middle East had comparatively strong performance. Lastly, we find that a combined model trained on the Worldwide dataset and either CoNLL or OntoNotes lost only 1-2 F1 on both test sets.
Paper Structure (36 sections, 14 tables)