Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?
Gaurav Kamath, Sowmya Vajjala
TL;DR
This study investigates whether synthetic data generated by multilingual large language models can aid NER for 11 low-resource languages. It introduces a seed-based synthetic data pipeline and compares two NER training regimes—training from scratch and fine-tuning on a related language—against organic data and WikiANN baselines. The findings show that a small amount of manually annotated data typically outperforms large synthetic datasets, though synthetic data can rival or exceed WikiANN in many languages, with substantial language-dependent variation. The work highlights both the promise and the limits of LLM-driven data augmentation for low-resource NER and underlines the need for robust gold-standard benchmarks in multilingual evaluation.
Abstract
Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.
