Table of Contents
Fetching ...

Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

Toqeer Ehsan, Thamar Solorio

TL;DR

The paper tackles NER in four low-resource Pakistani languages by introducing a cross-lingual data augmentation framework that blends cluster-based augmentation, EDA-RRAug, and GenerativeAug. It shows that cluster-based augmentation, leveraging unsupervised entity clustering, alignment, and ranking, yields the strongest gains for Shahmukhi and Pashto, while Urdu benefits more from generative augmentation; Sindhi benefits from cross-lingual representations in multilingual settings. The work also probes few-shot learning with causal LLMs, revealing current limitations in low-resource NER. Overall, the findings demonstrate that hybrid augmentation strategies can improve NER in related languages by preserving linguistic plausibility and cross-lingual diversity, though annotation quality and dataset size remain critical factors. The study highlights practical implications for building NER systems in multilingual, culturally nuanced, low-resource contexts and points to future work on improving annotations and scaling cross-lingual augmentation.

Abstract

Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

TL;DR

The paper tackles NER in four low-resource Pakistani languages by introducing a cross-lingual data augmentation framework that blends cluster-based augmentation, EDA-RRAug, and GenerativeAug. It shows that cluster-based augmentation, leveraging unsupervised entity clustering, alignment, and ranking, yields the strongest gains for Shahmukhi and Pashto, while Urdu benefits more from generative augmentation; Sindhi benefits from cross-lingual representations in multilingual settings. The work also probes few-shot learning with causal LLMs, revealing current limitations in low-resource NER. Overall, the findings demonstrate that hybrid augmentation strategies can improve NER in related languages by preserving linguistic plausibility and cross-lingual diversity, though annotation quality and dataset size remain critical factors. The study highlights practical implications for building NER systems in multilingual, culturally nuanced, low-resource contexts and points to future work on improving annotations and scaling cross-lingual augmentation.

Abstract

Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

Paper Structure

This paper contains 30 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Examples of clustering-based data augmentation applied to three sample sentences. Entity mentions are represented in orange, blue and green colors.
  • Figure 2: Sample Urdu sentences for the analysis of EDA. Named entities are highlighted in bold.
  • Figure 3: Cluster-based data augmentation process, which contains three phases. The entity clustering phase extracts unsupervised clusters for each entity type, alignment phase aligns cluster dictionaries with respect to the source (original) entities and the final phase ranks the source entity mentions with the best candidate. The original dataset corresponds to the manually annotated dataset, while the augmented dataset is the updated version obtained through the augmentation process.