Table of Contents
Fetching ...

Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

Gaurav Kamath, Sowmya Vajjala

TL;DR

This study investigates whether synthetic data generated by multilingual large language models can aid NER for 11 low-resource languages. It introduces a seed-based synthetic data pipeline and compares two NER training regimes—training from scratch and fine-tuning on a related language—against organic data and WikiANN baselines. The findings show that a small amount of manually annotated data typically outperforms large synthetic datasets, though synthetic data can rival or exceed WikiANN in many languages, with substantial language-dependent variation. The work highlights both the promise and the limits of LLM-driven data augmentation for low-resource NER and underlines the need for robust gold-standard benchmarks in multilingual evaluation.

Abstract

Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.

Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

TL;DR

This study investigates whether synthetic data generated by multilingual large language models can aid NER for 11 low-resource languages. It introduces a seed-based synthetic data pipeline and compares two NER training regimes—training from scratch and fine-tuning on a related language—against organic data and WikiANN baselines. The findings show that a small amount of manually annotated data typically outperforms large synthetic datasets, though synthetic data can rival or exceed WikiANN in many languages, with substantial language-dependent variation. The work highlights both the promise and the limits of LLM-driven data augmentation for low-resource NER and underlines the need for robust gold-standard benchmarks in multilingual evaluation.

Abstract

Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.

Paper Structure

This paper contains 16 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: High-level overview of our data generation process. We use multilingual large language models to generate new NER data points on the basis of a handful of high quality human labeled data points. See Section \ref{['subsec:datagen']} for more.
  • Figure 2: NER model performance when trained on increasingly large subsets of training data. aya-expanse-32b and Llama-3.1-8B-Instruct produced lower amounts of usable data; this is why they do often not extend as far as organic or GPT-4.1-produced data in fine-tuning data size. In the ner fine-tuning setting, performance at Fine-tuning Training Data Size = 0 indicates zero-shot performance of a related-language NER model.