Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation
Toqeer Ehsan, Thamar Solorio
TL;DR
The paper tackles NER in four low-resource Pakistani languages by introducing a cross-lingual data augmentation framework that blends cluster-based augmentation, EDA-RRAug, and GenerativeAug. It shows that cluster-based augmentation, leveraging unsupervised entity clustering, alignment, and ranking, yields the strongest gains for Shahmukhi and Pashto, while Urdu benefits more from generative augmentation; Sindhi benefits from cross-lingual representations in multilingual settings. The work also probes few-shot learning with causal LLMs, revealing current limitations in low-resource NER. Overall, the findings demonstrate that hybrid augmentation strategies can improve NER in related languages by preserving linguistic plausibility and cross-lingual diversity, though annotation quality and dataset size remain critical factors. The study highlights practical implications for building NER systems in multilingual, culturally nuanced, low-resource contexts and points to future work on improving annotations and scaling cross-lingual augmentation.
Abstract
Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.
